## Community Detection

The notebook shows the usage of the `graphdatascience` library for community detection on the Reddit Hyperlink Network dataset that can be downloaded [here](https://snap.stanford.edu/data/soc-RedditHyperlinks.html). We will use the `soc-redditHyperlinks-body.tsv` file.

The tasks we cover here include performing initial graph preprocessing using Weakly Connected Components and then performing community detection on the largest component using the Louvain algorithm.

### Setup

We need to import the following libraries:
- graphdatascience
- neo4j
- pandas

In [1]:
from graphdatascience import GraphDataScience
from neo4j import GraphDatabase
from neo4j.exceptions import ServiceUnavailable
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# # Replace with the actual connection URI and credentials
NEO4J_CONNECTION_URI = "bolt://XXXXXXXXXXXXX
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "XXXXXXXXXXXXX"

# Client instantiation
gds = GraphDataScience(
    NEO4J_CONNECTION_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD)
)

print(gds.version())

2.3.5


### Importing the dataset

We import the dataset as a pandas dataframe first. We work with only a subset of the dataset. The sampled data is only till 1st March 2014. 

In [3]:
df = pd.read_csv('soc-redditHyperlinks-body.tsv', sep='\t')
df = df[df['TIMESTAMP'] < "2014-03-01 02:51:13"]
df.head()

Unnamed: 0,SOURCE_SUBREDDIT,TARGET_SUBREDDIT,POST_ID,TIMESTAMP,LINK_SENTIMENT,PROPERTIES
0,leagueoflegends,teamredditteams,1u4nrps,2013-12-31 16:39:58,1,"345.0,298.0,0.75652173913,0.0173913043478,0.08..."
1,theredlion,soccer,1u4qkd,2013-12-31 18:18:37,-1,"101.0,98.0,0.742574257426,0.019801980198,0.049..."
2,inlandempire,bikela,1u4qlzs,2014-01-01 14:54:35,1,"85.0,85.0,0.752941176471,0.0235294117647,0.082..."
3,nfl,cfb,1u4sjvs,2013-12-31 17:37:55,1,"1124.0,949.0,0.772241992883,0.0017793594306,0...."
4,playmygame,gamedev,1u4w5ss,2014-01-01 02:51:13,1,"715.0,622.0,0.777622377622,0.00699300699301,0...."


The `LINK_SENTIMENT` column tells if there is a positive (+1) or negative (-1) relationship from the source subreddit to destination subreddit. We filter out the negative sentiment relationships as they won't add to any meaningful communities. We also drop duplicate relationships.

In [4]:
relationship_df = df[df['LINK_SENTIMENT'] == 1]
columns = ['SOURCE_SUBREDDIT', 'TARGET_SUBREDDIT']
relationship_df = relationship_df[columns]
relationship_df = relationship_df.drop_duplicates()
relationship_df.head()

Unnamed: 0,SOURCE_SUBREDDIT,TARGET_SUBREDDIT
0,leagueoflegends,teamredditteams
2,inlandempire,bikela
3,nfl,cfb
4,playmygame,gamedev
5,dogemarket,dogecoin


Next, we get a list of all the distinct nodes (source or destination) and load them as a dataframe.

In [5]:
# get unique nodes for each column
source_nodes = pd.Series(df['SOURCE_SUBREDDIT']).unique()
target_nodes = pd.Series(df['TARGET_SUBREDDIT']).unique()

# get unique nodes for both columns
all_nodes = pd.Series(pd.concat([df['SOURCE_SUBREDDIT'], df['TARGET_SUBREDDIT']])).unique()

# create new dataframe with distinct nodes
nodes_df = pd.DataFrame({'SUBREDDIT': all_nodes})
nodes_df.head()

Unnamed: 0,SUBREDDIT
0,leagueoflegends
1,theredlion
2,inlandempire
3,nfl
4,playmygame


Finally, we load this data (nodes and edges) into a Graph Database and a GDS graph.

In [6]:
driver = GraphDatabase.driver(NEO4J_CONNECTION_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

# Create nodes and relationships in the graph using UNWIND
with driver.session() as session:
    # Create nodes using UNWIND
    nodes_list = nodes_df.to_dict('records')
    session.run("UNWIND $nodes_list AS node_props CREATE (n:Subreddit {node_id: node_props.SUBREDDIT, node_label: node_props.SUBREDDIT})", nodes_list=nodes_list)

    # Create relationships using UNWIND
    edges_list = relationship_df.to_dict('records')
    session.run("UNWIND $edges_list AS rel_props MATCH (source:Subreddit {node_id: rel_props.SOURCE_SUBREDDIT}), (target:Subreddit {node_id: rel_props.TARGET_SUBREDDIT}) CREATE (source)-[:HYPERLINKED_TO {relationship_type: rel_props.relationship_type}]->(target)", edges_list=edges_list)

In [7]:
node_projection = ["Subreddit"]
relationship_projection = {"HYPERLINKED_TO": {"orientation": "NATURAL"}}

G, result = gds.graph.project("reddit", node_projection, relationship_projection) #, nodeProperties = ['node_id', 'node_label']

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

Loading: 100%|██████████| 100.0/100 [00:09<00:00, 11.02%/s] 


The projection took 9289 ms
Graph 'reddit' node count: 3801
Graph 'reddit' node labels: ['Subreddit']


In [8]:
gds.graph.list()

Unnamed: 0,degreeDistribution,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema,schemaWithOrientation
0,"{'p99': 15, 'min': 0, 'max': 87, 'mean': 1.631...",reddit,neo4j,876 KiB,897448,3801,6200,{'relationshipProjection': {'HYPERLINKED_TO': ...,0.000429,2023-05-14T15:29:30.028518203+00:00,2023-05-14T15:29:30.931965175+00:00,"{'graphProperties': {}, 'relationships': {'HYP...","{'graphProperties': {}, 'relationships': {'HYP..."


### Weakly Connected Components

A graph dataset need not always be connected. That is, there may not exist a path from every node to 
every other node in the graph dataset (subgraphs in it may not connected to each other at all). Hence, we 
need to find the total number of nodes in each subgraph to see if it is big enough for further graph analysis. 
Smaller subgraphs or lone nodes will not contribute to the community detection task and should be 
eliminated. Weakly Connected Components is often used as one of the early steps of graph preprocessing.

We use the [Weakly Connected Components](https://neo4j.com/docs/graph-data-science/2.4-preview/algorithms/wcc/) algorithm to find sets of connected nodes and assign each set a component id.

In [9]:
df = gds.wcc.mutate(G, mutateProperty='componentId')
print(df.configuration)

{'jobId': 'b69faaaa-a267-444c-82b2-d11c66f9a6a4', 'seedProperty': None, 'consecutiveIds': False, 'threshold': 0.0, 'logProgress': True, 'nodeLabels': ['*'], 'sudo': False, 'relationshipTypes': ['*'], 'mutateProperty': 'componentId', 'concurrency': 4}


In [10]:
G.node_properties()

Subreddit    [componentId]
dtype: object

In [11]:
query = """
    CALL gds.wcc.stream('reddit')
    YIELD nodeId, componentId
    RETURN componentId, collect(gds.util.asNode(nodeId).node_id) AS Subreddits, size(collect(gds.util.asNode(nodeId).node_id)) AS Num_subreddits
    ORDER BY size(Subreddits) DESC
"""
wcc = gds.run_cypher(query)
wcc

Unnamed: 0,componentId,Subreddits,Num_subreddits
0,0,"[leagueoflegends, nfl, playmygame, dogemarket,...",3172
1,278,"[orangered, orangeredacademy, pasto_range, per...",20
2,23,"[thedoctorstravels, sirron, aislynisdead, game...",8
3,768,"[iracing, simracing, redditracing, team_medioc...",6
4,832,"[perfumeexchange, indiemakeupandmore, asianbea...",6
...,...,...,...
314,3712,[aggies],1
315,3759,[brunei],1
316,3769,[descentintotyranny],1
317,3771,[outofthemetaloop],1


We can see that the component with Id 0 has the max number of subreddits = 3172. So we will work only with that subgraph.

In [12]:
Largest_CC, _ = gds.beta.graph.project.subgraph(
      'largest_connected_components2', 
      G,
      'n.componentId=0', 
      '*'
    )

In [13]:
Largest_CC

Graph({'graphName': 'largest_connected_components2', 'nodeCount': 3172, 'relationshipCount': 5858, 'database': 'neo4j', 'configuration': {'relationshipProperties': {}, 'creationTime': neo4j.time.DateTime(2023, 5, 14, 15, 29, 52, 126057108, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'validateRelationships': False, 'nodeFilter': 'n.componentId=0', 'relationshipFilter': '*', 'nodeProperties': {}, 'concurrency': 4, 'relationshipProjection': {'HYPERLINKED_TO': {'orientation': 'NATURAL', 'indexInverse': False, 'aggregation': 'DEFAULT', 'type': 'HYPERLINKED_TO', 'properties': {}}}, 'jobId': 'e1d3750e-61f5-4928-b16c-4f5f566e09f1', 'nodeProjection': {'Subreddit': {'label': 'Subreddit', 'properties': {}}}, 'logProgress': True, 'readConcurrency': 4, 'sudo': False, 'parameters': {}}, 'schema': {'graphProperties': {}, 'relationships': {'HYPERLINKED_TO': {}}, 'nodes': {'Subreddit': {'componentId': 'Integer (DefaultValue(-9223372036854775808), TRANSIENT)'}}}, 'memoryUsage': '901 KiB'})

### Community Detection using Louvain

We use the [Louvain](https://neo4j.com/docs/graph-data-science/2.4-preview/algorithms/louvain/) algorithm to detect communities in our subgraph and assign a louvainCommunityId to each community.

In [14]:
df2 = gds.louvain.mutate(Largest_CC, mutateProperty='louvainCommunityId')
df2

Louvain: 100%|██████████| 100.0/100 [00:12<00:00,  7.95%/s]


mutateMillis                                                             0
nodePropertiesWritten                                                 3172
modularity                                                         0.58988
modularities             [0.4494089141198883, 0.5373675216145954, 0.555...
ranLevels                                                               10
communityCount                                                         300
communityDistribution    {'p99': 196, 'min': 1, 'max': 382, 'mean': 10....
postProcessingMillis                                                    22
preProcessingMillis                                                      1
computeMillis                                                        12974
configuration            {'maxIterations': 10, 'seedProperty': None, 'c...
Name: 0, dtype: object

We get a modularity score of 0.5898 for our community detection algorithm.

In [15]:
df2.modularity

0.5898798012505129

In [16]:
Largest_CC.node_properties()

Subreddit    [componentId, louvainCommunityId]
dtype: object

In [17]:
query = """
    CALL gds.louvain.write('largest_connected_components2', { writeProperty: 'louvainCommunityId' })
    YIELD communityCount, modularity, modularities
"""
communities = gds.run_cypher(query)
communities

Unnamed: 0,communityCount,modularity,modularities
0,300,0.58988,"[0.4494089141198883, 0.5373675216145954, 0.555..."


In [18]:
Largest_CC.node_properties()

Subreddit    [componentId, louvainCommunityId]
dtype: object

In [19]:
query = """
    CALL gds.louvain.stream('largest_connected_components2')
    YIELD nodeId, communityId, intermediateCommunityIds
    RETURN collect(gds.util.asNode(nodeId).node_id) AS Subreddits, communityId
    ORDER BY size(Subreddits) DESC
"""
wcc = gds.run_cypher(query)
wcc

Unnamed: 0,Subreddits,communityId
0,"[airsoft, bandnames, connecticut, thehiddenbar...",2406
1,"[posthardcore, metalcore, corejerk, iama, karm...",2612
2,"[locationbot, oldschoolcoolnsfw, uncomfortable...",2579
3,"[playmygame, circlebroke, tribes, conspiratard...",2676
4,"[radioreddit, autism, modhelp, digital_immorta...",3158
...,...,...
295,[banishedmaps],3032
296,[screenshots],3034
297,[leangains],3039
298,[agnostic],3040
