# Community Detection

<a target="_blank" href="https://colab.research.google.com/github/neo4j/graph-data-science-client/blob/main/examples/community-detection.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This Jupyter notebook is hosted [here](https://github.com/neo4j/graph-data-science-client/blob/main/examples/community-detection.ipynb) in the Neo4j Graph Data Science Client Github repository.

The notebook shows the usage of the `graphdatascience` library for community detection on the Reddit Hyperlink Network dataset that can be downloaded [here](https://snap.stanford.edu/data/soc-RedditHyperlinks.html). We will use the `soc-redditHyperlinks-body.tsv` file.

The tasks we cover here include performing initial graph preprocessing using Weakly Connected Components and then performing community detection on the largest component using the Louvain algorithm.

### Setup

We need to import the following libraries:
- graphdatascience
- neo4j
- pandas

In [1]:
from graphdatascience import GraphDataScience
from neo4j import GraphDatabase
from neo4j.exceptions import ServiceUnavailable
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# # Replace with the actual connection URI and credentials
NEO4J_CONNECTION_URI = "bolt://54.152.132.224:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "scissors-hoists-tastes"

# Client instantiation
gds = GraphDataScience(
    NEO4J_CONNECTION_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD)
)

# NEO4J_URI = os.environ.get("NEO4J_URI", "bolt://localhost:7687")
# NEO4J_AUTH = None
# if os.environ.get("NEO4J_USER") and os.environ.get("NEO4J_PASSWORD"):
#     NEO4J_AUTH = (
#         os.environ.get("NEO4J_USER"),
#         os.environ.get("NEO4J_PASSWORD"),
#     )

# gds = GraphDataScience(NEO4J_URI, auth=NEO4J_AUTH)

### Importing the dataset

We import the dataset as a pandas dataframe first. We work with only a subset of the dataset. The sampled data is only till 1st March 2014. 

In [3]:
df = pd.read_csv('https://snap.stanford.edu/data/soc-redditHyperlinks-body.tsv', sep='\t')
df = df[df['TIMESTAMP'] < "2014-03-01 02:51:13"]
df.head()

Unnamed: 0,SOURCE_SUBREDDIT,TARGET_SUBREDDIT,POST_ID,TIMESTAMP,LINK_SENTIMENT,PROPERTIES
0,leagueoflegends,teamredditteams,1u4nrps,2013-12-31 16:39:58,1,"345.0,298.0,0.75652173913,0.0173913043478,0.08..."
1,theredlion,soccer,1u4qkd,2013-12-31 18:18:37,-1,"101.0,98.0,0.742574257426,0.019801980198,0.049..."
2,inlandempire,bikela,1u4qlzs,2014-01-01 14:54:35,1,"85.0,85.0,0.752941176471,0.0235294117647,0.082..."
3,nfl,cfb,1u4sjvs,2013-12-31 17:37:55,1,"1124.0,949.0,0.772241992883,0.0017793594306,0...."
4,playmygame,gamedev,1u4w5ss,2014-01-01 02:51:13,1,"715.0,622.0,0.777622377622,0.00699300699301,0...."


The `LINK_SENTIMENT` column tells if there is a positive (+1) or negative (-1) relationship from the source subreddit to destination subreddit. We filter out the negative sentiment relationships as they won't add to any meaningful communities. We also drop duplicate relationships.

In [4]:
relationship_df = df[df['LINK_SENTIMENT'] == 1]
columns = ['SOURCE_SUBREDDIT', 'TARGET_SUBREDDIT']
relationship_df = relationship_df[columns]
relationship_df = relationship_df.drop_duplicates()
relationship_df.head()

Unnamed: 0,SOURCE_SUBREDDIT,TARGET_SUBREDDIT
0,leagueoflegends,teamredditteams
2,inlandempire,bikela
3,nfl,cfb
4,playmygame,gamedev
5,dogemarket,dogecoin


Next, we get a list of all the distinct nodes (source or destination) and load them as a dataframe.

In [5]:
# get unique nodes for each column
source_nodes = pd.Series(df['SOURCE_SUBREDDIT']).unique()
target_nodes = pd.Series(df['TARGET_SUBREDDIT']).unique()

# get unique nodes for both columns
all_nodes = pd.Series(pd.concat([df['SOURCE_SUBREDDIT'], df['TARGET_SUBREDDIT']])).unique()

# create new dataframe with distinct nodes
nodes_df = pd.DataFrame({'SUBREDDIT': all_nodes})
nodes_df.head()

Unnamed: 0,SUBREDDIT
0,leagueoflegends
1,theredlion
2,inlandempire
3,nfl
4,playmygame


Finally, we load this data (nodes and edges) into a Graph Database and a GDS graph.

In [6]:
gds.run_cypher(
    "UNWIND $nodes_list AS node_props CREATE (n:Subreddit {name: node_props.SUBREDDIT})",
    params = {'nodes_list': nodes_df.to_dict('records')})

gds.run_cypher(
    "UNWIND $edges_list AS rel_props MATCH (source:Subreddit {name: rel_props.SOURCE_SUBREDDIT}), (target:Subreddit {name: rel_props.TARGET_SUBREDDIT}) CREATE (source)-[:HYPERLINKED_TO {relationship_type: rel_props.relationship_type}]->(target)", 
    params = {'edges_list': relationship_df.to_dict('records')})

In [7]:
node_projection = ["Subreddit"]
relationship_projection = {"HYPERLINKED_TO": {"orientation": "NATURAL"}}

G, result = gds.graph.project("reddit", node_projection, relationship_projection) #, nodeProperties = ['node_id', 'node_label']

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

Loading: 100%|██████████| 100.0/100 [00:10<00:00,  9.27%/s] 


The projection took 11405 ms
Graph 'reddit' node count: 3801
Graph 'reddit' node labels: ['Subreddit']


In [8]:
gds.graph.list()

Unnamed: 0,degreeDistribution,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema,schemaWithOrientation
0,"{'p99': 15, 'min': 0, 'max': 87, 'mean': 1.631...",reddit,neo4j,876 KiB,897448,3801,6200,{'relationshipProjection': {'HYPERLINKED_TO': ...,0.000429,2023-05-16T14:07:25.933283995+00:00,2023-05-16T14:07:27.112253586+00:00,"{'graphProperties': {}, 'relationships': {'HYP...","{'graphProperties': {}, 'relationships': {'HYP..."


### Weakly Connected Components

A graph dataset need not always be connected. That is, there may not exist a path from every node to 
every other node in the graph dataset (subgraphs in it may not connected to each other at all). Hence, we 
need to find the total number of nodes in each subgraph to see if it is big enough for further graph analysis. 
Smaller subgraphs or lone nodes will not contribute to the community detection task and should be 
eliminated. Weakly Connected Components is often used as one of the early steps of graph preprocessing.

We use the [Weakly Connected Components](https://neo4j.com/docs/graph-data-science/2.4-preview/algorithms/wcc/) algorithm to find sets of connected nodes and assign each set a component id.

In [9]:
df = gds.wcc.mutate(G, mutateProperty='componentId')
print(df.configuration)

{'jobId': 'c7d9036d-b9a5-4d91-8d95-70bccfd67c2d', 'seedProperty': None, 'consecutiveIds': False, 'threshold': 0.0, 'logProgress': True, 'nodeLabels': ['*'], 'sudo': False, 'relationshipTypes': ['*'], 'mutateProperty': 'componentId', 'concurrency': 4}


In [10]:
G.node_properties()

Subreddit    [componentId]
dtype: object

Next, we will see the size of each connected component and depending on that, we can pick the subgraph that needs further analysis.

We use `run_cypher` here instead of the direct gds client call since we want to see the size of the connected components.

In [14]:
query = """
    CALL gds.graph.nodeProperties.stream('reddit', 'componentId')
    YIELD nodeId, propertyValue
    WITH nodeId as nodeId, gds.util.asNode(nodeId).name AS node, propertyValue AS componentId
    WITH componentId, collect(node) AS subreddit, size(collect(nodeId)) AS communitySize
    RETURN componentId, communitySize, subreddit
    ORDER BY communitySize DESC
"""

# query = """
#     CALL gds.graph.nodeProperties.stream('reddit', 'componentId')
#     YIELD name, propertyValue
#     WITH name as name, gds.util.asNode(name).name AS name, propertyValue AS componentId
#     WITH componentId, collect(name) AS subreddits, size(collect(name)) AS communitySize
#     RETURN componentId, communitySize, subreddits
#     ORDER BY communitySize DESC
# """

# query = """
#     CALL gds.wcc.stream('reddit')
#     YIELD nodeId, componentId
#     RETURN componentId, collect(gds.util.asNode(nodeId).node_id) AS Subreddits, size(collect(gds.util.asNode(nodeId).node_id)) AS Num_subreddits
#     ORDER BY size(Subreddits) DESC
# """
wcc = gds.run_cypher(query)
wcc

Unnamed: 0,componentId,communitySize,subreddit
0,0,3172,"[leagueoflegends, nfl, playmygame, dogemarket,..."
1,278,20,"[orangered, orangeredacademy, pasto_range, per..."
2,23,8,"[thedoctorstravels, sirron, aislynisdead, game..."
3,768,6,"[iracing, simracing, redditracing, team_medioc..."
4,832,6,"[perfumeexchange, indiemakeupandmore, asianbea..."
...,...,...,...
314,3712,1,[aggies]
315,3759,1,[brunei]
316,3769,1,[descentintotyranny]
317,3771,1,[outofthemetaloop]


We can see that the component with Id 0 has the max number of subreddits = 3172. So we will work only with that subgraph.

In [17]:
Largest_CC, _ = gds.beta.graph.project.subgraph(
      'largest_connected_components', 
      G,
      'n.componentId=0', 
      '*'
    )

In [18]:
Largest_CC

Graph({'graphName': 'largest_connected_components', 'nodeCount': 3172, 'relationshipCount': 5858, 'database': 'neo4j', 'configuration': {'relationshipProperties': {}, 'creationTime': neo4j.time.DateTime(2023, 5, 16, 14, 43, 14, 779680794, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'validateRelationships': False, 'nodeFilter': 'n.componentId=0', 'relationshipFilter': '*', 'nodeProperties': {}, 'concurrency': 4, 'relationshipProjection': {'HYPERLINKED_TO': {'orientation': 'NATURAL', 'indexInverse': False, 'aggregation': 'DEFAULT', 'type': 'HYPERLINKED_TO', 'properties': {}}}, 'jobId': 'f3f02536-8052-4bc9-b143-ca649e16e0d0', 'nodeProjection': {'Subreddit': {'label': 'Subreddit', 'properties': {}}}, 'logProgress': True, 'readConcurrency': 4, 'sudo': False, 'parameters': {}}, 'schema': {'graphProperties': {}, 'relationships': {'HYPERLINKED_TO': {}}, 'nodes': {'Subreddit': {'componentId': 'Integer (DefaultValue(-9223372036854775808), TRANSIENT)'}}}, 'memoryUsage': '901 KiB'})

### Community Detection using Louvain

We use the [Louvain](https://neo4j.com/docs/graph-data-science/2.4-preview/algorithms/louvain/) algorithm to detect communities in our subgraph and assign a louvainCommunityId to each community.

In [19]:
df2 = gds.louvain.mutate(Largest_CC, mutateProperty='louvainCommunityId')
df2

Louvain: 100%|██████████| 100.0/100 [00:11<00:00,  8.74%/s]


mutateMillis                                                             4
nodePropertiesWritten                                                 3172
modularity                                                        0.587643
modularities             [0.4494090889646058, 0.5377130147763601, 0.555...
ranLevels                                                               10
communityCount                                                         300
communityDistribution    {'p99': 196, 'min': 1, 'max': 382, 'mean': 10....
postProcessingMillis                                                    17
preProcessingMillis                                                      0
computeMillis                                                        12084
configuration            {'maxIterations': 10, 'seedProperty': None, 'c...
Name: 0, dtype: object

We get a modularity score of 0.5898 for our community detection algorithm.

In [20]:
gds.graph.nodeProperties.write(Largest_CC, ["louvainCommunityId"])

writeMillis                                   578
graphName            largest_connected_components
nodeProperties               [louvainCommunityId]
propertiesWritten                            3172
Name: 0, dtype: object

We can also check that the property was written by the below command.

In [21]:
Largest_CC.node_properties()

Subreddit    [componentId, louvainCommunityId]
dtype: object

In [23]:
query = """
    CALL gds.graph.nodeProperties.stream('largest_connected_components', 'louvainCommunityId')
    YIELD nodeId, propertyValue
    WITH nodeId as nodeId, gds.util.asNode(nodeId).name AS node, propertyValue AS communityId
    WITH communityId, collect(node) AS subreddit, size(collect(nodeId)) AS communitySize
    RETURN communityId, communitySize, subreddit
    ORDER BY communitySize DESC
"""

communities = gds.run_cypher(query)
communities

Unnamed: 0,communityId,communitySize,subreddit
0,2406,382,"[airsoft, bandnames, connecticut, thehiddenbar..."
1,2516,309,"[posthardcore, metalcore, corejerk, iama, karm..."
2,2654,282,"[locationbot, oldschoolcoolnsfw, uncomfortable..."
3,2676,196,"[playmygame, circlebroke, tribes, conspiratard..."
4,2546,185,"[leagueoflegends, kpop, turntablists, minecraf..."
...,...,...,...
295,3034,1,[screenshots]
296,3039,1,[leangains]
297,3040,1,[agnostic]
298,3043,1,[mario]


### References
S. Kumar, W.L. Hamilton, J. Leskovec, D. Jurafsky. Community Interaction and Conflict on the Web. World Wide Web Conference, 2018.