# Static Community Detection Using CDLIB and NetworkX - JUST

After calculating similairties and updating the edge list with the values, we will look to execute the well-known Louvain and Leiden community detection algorithms using NetworkX library and CDLIB.

## Importing Edge List w/ Weights to NetworkX

NetworkX's read_weighted_edgelist function expects a simple text file with lines of the form <node1> <node2> <weight>, without headers. Since our data is in CSV format, you'll need to use Pandas (or another method) to load the CSV and adjust it to become readable.

In [1]:
import pandas as pd

edge_list_df = pd.read_csv('New Input/JUST_edge_list_with_similarity.csv')

print(edge_list_df)

       source  target    weight
0          p0      a1  0.810501
1          p0      a2  0.825994
2          p0      a3  0.846090
3          p0      a4  0.766287
4          p0      a5  0.839626
...       ...     ...       ...
51372  p10169   p8094  0.586715
51373  p10169   p7974  0.669212
51374  p10169   p5852  0.599799
51375  p10169  p10113  0.639972
51376  p10169  p10031  0.488926

[51377 rows x 3 columns]


In [2]:
negative_weights = edge_list_df[edge_list_df['weight'] < 0]
print(f"Number of edges with negative weights: {len(negative_weights)}")

Number of edges with negative weights: 21


Since Louvain is not made to consider negative edge weights, we will rescale the weights such that instead of [-1, 1] being the range, it is now [0, 1], where 0 now represents perfect dissimilarity, 0.5 represents orthogonality, and 1 represents perfect similarity.

In [3]:
edge_list_df['weight'] = (edge_list_df['weight'] + 1) / 2

print(edge_list_df)

       source  target    weight
0          p0      a1  0.905251
1          p0      a2  0.912997
2          p0      a3  0.923045
3          p0      a4  0.883144
4          p0      a5  0.919813
...       ...     ...       ...
51372  p10169   p8094  0.793358
51373  p10169   p7974  0.834606
51374  p10169   p5852  0.799899
51375  p10169  p10113  0.819986
51376  p10169  p10031  0.744463

[51377 rows x 3 columns]


Before we continue with the creation of a graph, NetworkX specifies that an undirected, weighted graph must not have self-loop, parallel edges (A->B, B->A), or duplicate edges.

In [5]:
duplicate_edges = edge_list_df.duplicated(subset=['source', 'target'], keep=False)
print(f"Number of duplicate edges: {duplicate_edges.sum()}")

self_loops = edge_list_df[edge_list_df['source'] == edge_list_df['target']]
print(f"Number of self-loops: {len(self_loops)}")
print(self_loops)

print(edge_list_df.isnull().sum())

Number of duplicate edges: 0
Number of self-loops: 4
      source target  weight
2030    p661   p661     1.0
18335  p4680  p4680     1.0
25834  p6225  p6225     1.0
45560  p9359  p9359     1.0
source    0
target    0
weight    0
dtype: int64


In [6]:
print(f"Number of edges before dropping self-loops: {len(edge_list_df)}")

edge_list_df = edge_list_df[edge_list_df['source'] != edge_list_df['target']]

print(f"Number of edges after dropping self-loops: {len(edge_list_df)}")

Number of edges before dropping self-loops: 51377
Number of edges after dropping self-loops: 51373


In [7]:
# Find duplicate edges (ignoring the weight column)
duplicate_edges = edge_list_df.duplicated(subset=['source', 'target'], keep=False)

# Filter to get only the duplicate edges
parallel_edges_df = edge_list_df[duplicate_edges]

# Sort to better visualize parallel edges
parallel_edges_sorted = parallel_edges_df.sort_values(by=['source', 'target'])

print(parallel_edges_sorted)

Empty DataFrame
Columns: [source, target, weight]
Index: []


## Creating Undirected Weighted Graph

We iterate over the edge list DataFrame rows to add edges along with their weights to a new NetworkX graph.


In [8]:
import networkx as nx

def get_graph_info(graph):
    print("Number of nodes:", graph.number_of_nodes())
    print("Number of edges:", graph.number_of_edges())
    
    # Checking the graph type to provide appropriate information
    if isinstance(graph, nx.DiGraph):
        print("Graph is Directed")
    else:
        print("Graph is Undirected")


In [9]:
# Initialize a new graph
G = nx.MultiGraph()

# Add edges and weights
for index, row in edge_list_df.iterrows():
    source = row['source']
    target = row['target']
    weight = row['weight']
    
    # Add the edge with weight
    G.add_edge(source, target, weight=weight)

In [10]:
get_graph_info(G)

Number of nodes: 15649
Number of edges: 51373
Graph is Undirected


## Running Louvain Using CDLIB

CDlib (Community Discovery Library) is designed for community detection and analysis, providing easy access to various algorithms, including Louvain and Leiden, and tools for evaluating and visualizing the results.

In [11]:
from cdlib import algorithms


Note: to be able to use all crisp methods, you need to install some additional packages:  {'bayanpy', 'wurlitzer', 'graph_tool', 'infomap'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'pyclustering', 'ASLPAw'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'wurlitzer', 'infomap'}


In [12]:
communities_louvain = algorithms.louvain(G)

# Accessing the number of communities/partitions
num_partitions = len(communities_louvain.communities)
print(f"Number of partitions: {num_partitions}")

# Accessing modularity
modularity = communities_louvain.newman_girvan_modularity().score
print(f"Modularity: {modularity}")

Number of partitions: 1697
Modularity: 0.6216423331358831


## Running Leiden Using CDLIB

In [13]:
# Assuming `G` is your NetworkX graph with non-negative weights
communities_leiden = algorithms.leiden(G)

# Accessing the number of communities and other properties
num_partitions = len(communities_leiden.communities)
print(f"Number of partitions: {num_partitions}")

# Accessing modularity
modularity = communities_leiden.newman_girvan_modularity().score
print(f"Modularity: {modularity}")

Number of partitions: 23
Modularity: 0.7399579052237379


## Saving Node List w/ Community Assignments

In order for us to visualize the partitions, we need to iterate through each partition and assign an ID to every node in it. This way we can color code when visualizing to see which nodes were assigned together. 

### Load Node ID List

This was copied from Similar+Weights folder.

In [14]:
import pandas as pd

# Load node IDs into a DataFrame
node_df = pd.read_csv('New Input/JUST_node_indexes.txt', header=None, names=['nodeID'])

display(node_df)

Unnamed: 0,nodeID
0,t10503
1,v10189
2,v10187
3,v10181
4,v10188
...,...
15644,t11912
15645,p15275
15646,a2418
15647,a2419


In [16]:
def get_type(node_id):
    if node_id.startswith('t'):
        return 'topic'
    elif node_id.startswith('a'):
        return 'author'
    elif node_id.startswith('v'):
        return 'venue'
    elif node_id.startswith('p'):
        return 'paper'
    else:
        return 'unknown'

node_df['type'] = node_df['nodeID'].apply(get_type)

display(node_df)

Unnamed: 0,nodeID,type
0,t10503,topic
1,v10189,venue
2,v10187,venue
3,v10181,venue
4,v10188,venue
...,...,...
15644,t11912,topic
15645,p15275,paper
15646,a2418,author
15647,a2419,author


In [17]:
import matplotlib.cm as cm
import matplotlib

### For Louvain

In [18]:
# Correctly accessing the communities for iteration
n_communities = len(communities_louvain.communities)
colors = cm.get_cmap('viridis', n_communities)

# Initialize the mapping dictionary
node_community_color_map = {}

for community_id, community_nodes in enumerate(communities_louvain.communities):
    color = colors(community_id / n_communities)  # Get a color from the colormap
    color_hex = matplotlib.colors.rgb2hex(color)  # Convert the color to hex format
    
    for node in community_nodes:
        node_community_color_map[str(node)] = {"communityID": community_id, "color": color_hex}

In [19]:
node_df_louvain = node_df.copy()

# Add community ID, color, and type to the DataFrame
node_df_louvain['communityID'] = node_df['nodeID'].apply(lambda x: node_community_color_map[x]['communityID'] if x in node_community_color_map else -1)
node_df_louvain['color'] = node_df['nodeID'].apply(lambda x: node_community_color_map[x]['color'] if x in node_community_color_map else '#000000')

display(node_df_louvain)

Unnamed: 0,nodeID,type,communityID,color
0,t10503,topic,1,#440154
1,v10189,venue,1,#440154
2,v10187,venue,0,#440154
3,v10181,venue,347,#404588
4,v10188,venue,1,#440154
...,...,...,...,...
15644,t11912,topic,131,#481c6e
15645,p15275,paper,4,#440154
15646,a2418,author,131,#481c6e
15647,a2419,author,131,#481c6e


In [20]:
# Select relevant columns if necessary and export to CSV
node_df_louvain[['nodeID', 'communityID', 'color', 'type']].to_csv('Output/node_metadata_JUST_Louvain.csv', index=False, sep=';')

### For Leiden

In [21]:

# Correctly accessing the communities for iteration
n_communities = len(communities_leiden.communities)
colors = cm.get_cmap('viridis', n_communities)

# Initialize the mapping dictionary
node_community_color_map = {}

for community_id, community_nodes in enumerate(communities_leiden.communities):
    color = colors(community_id / n_communities)  # Get a color from the colormap
    color_hex = matplotlib.colors.rgb2hex(color)  # Convert the color to hex format
    
    for node in community_nodes:
        node_community_color_map[str(node)] = {"communityID": community_id, "color": color_hex}

In [22]:
node_df_leiden = node_df.copy()

# Add community ID, color, and type to the DataFrame
node_df_leiden['communityID'] = node_df['nodeID'].apply(lambda x: node_community_color_map[x]['communityID'] if x in node_community_color_map else -1)
node_df_leiden['color'] = node_df['nodeID'].apply(lambda x: node_community_color_map[x]['color'] if x in node_community_color_map else '#000000')

display(node_df_leiden)

Unnamed: 0,nodeID,type,communityID,color
0,t10503,topic,8,#2d708e
1,v10189,venue,4,#433e85
2,v10187,venue,1,#471164
3,v10181,venue,0,#440154
4,v10188,venue,4,#433e85
...,...,...,...,...
15644,t11912,topic,7,#32648e
15645,p15275,paper,10,#25858e
15646,a2418,author,7,#32648e
15647,a2419,author,7,#32648e


In [23]:
# Select relevant columns if necessary and export to CSV
node_df_leiden[['nodeID', 'communityID', 'color', 'type']].to_csv('Output/node_metadata_JUST_Leiden.csv', index=False, sep=';')

## Exploring Community Assignments

We would also like to explore the difference in communities between Leiden and Louvain, check whether they make sense by looking at their content, and investigate why Leiden produces significantly less communities in comparison to Louvain.

### Import Original Node Information

First, we import the original dataset information including tag and artist names, as well as user IDs. This way we can check if the assignment makes sense, by looking at the actual content of its nodes.

For example, if there was a distinction of grouping by "Death Metal" vs. "Pop" tags.

In [20]:
tags_df = pd.read_csv('Extra Input/tags.dat', sep='\t', encoding='ISO-8859-1')
artists_df = pd.read_csv('Extra Input/artists.dat', sep='\t')
users_df = pd.read_csv('Extra Input/user_friends.dat', sep='\t')

artists_df = artists_df.drop(['url', 'pictureURL'], axis=1)

display(tags_df)
display(artists_df)
display(users_df)

Unnamed: 0,tagID,tagValue
0,1,metal
1,2,alternative metal
2,3,goth rock
3,4,black metal
4,5,death metal
...,...,...
11941,12644,suomi
11942,12645,symbiosis
11943,12646,sverige
11944,12647,eire


Unnamed: 0,id,name
0,1,MALICE MIZER
1,2,Diary of Dreams
2,3,Carpathian Forest
3,4,Moi dix Mois
4,5,Bella Morte
...,...,...
17627,18741,Diamanda Galás
17628,18742,Aya RL
17629,18743,Coptic Rain
17630,18744,Oz Alchemist


Unnamed: 0,userID,friendID
0,2,275
1,2,428
2,2,515
3,2,761
4,2,831
...,...,...
25429,2099,1801
25430,2099,2006
25431,2099,2016
25432,2100,586


To add the names of tags and artists to our main dataframes, we create dictionaries that help us directly map ID to name. We also define a function that adds the corresponding name to a new column based on the ID.

In [21]:
tag_dict = pd.Series(tags_df.tagValue.values, index=tags_df.tagID).to_dict()
artist_dict = pd.Series(artists_df.name.values, index=artists_df.id).to_dict()

def id_to_name(node_id):
    prefix = node_id[0]
    # Adjust this line based on the actual format of your node IDs
    numeric_part = node_id.split('_')[-1]  # This splits at the underscore and takes the last part
    
    try:
        numeric_id = int(numeric_part)
    except ValueError:
        return "Invalid ID format"
    
    if prefix == 't':
        return tag_dict.get(numeric_id, "Unknown Tag")
    elif prefix == 'a':
        return artist_dict.get(numeric_id, "Unknown Artist")
    return "Invalid prefix"



In [22]:
node_df_louvain['name'] = node_df_louvain['nodeID'].apply(id_to_name)
node_df_leiden['name'] = node_df_leiden['nodeID'].apply(id_to_name)

In [23]:
display(node_df_louvain)
display(node_df_leiden)

Unnamed: 0,nodeID,type,communityID,color,name
0,t_24,tag,4,#440256,pop
1,t_73,tag,4,#440256,rock
2,t_18,tag,1,#440154,electronic
3,t_81,tag,0,#440154,indie
4,t_79,tag,0,#440154,alternative
...,...,...,...,...,...
21513,t_403,tag,3,#440256,michael-jackson-the-king-of-pop
21514,a_16380,artist,2,#440154,Beyond The Embrace
21515,a_1637,artist,112,#3f4788,The Grouch & Eligh
21516,a_1635,artist,112,#3f4788,The D.O.C.


Unnamed: 0,nodeID,type,communityID,color,name
0,t_24,tag,1,#481b6d,pop
1,t_73,tag,0,#440154,rock
2,t_18,tag,0,#440154,electronic
3,t_81,tag,0,#440154,indie
4,t_79,tag,0,#440154,alternative
...,...,...,...,...,...
21513,t_403,tag,3,#3f4788,michael-jackson-the-king-of-pop
21514,a_16380,artist,2,#46327e,Beyond The Embrace
21515,a_1637,artist,6,#277f8e,The Grouch & Eligh
21516,a_1635,artist,6,#277f8e,The D.O.C.


### **For Louvain**

#### Overview of Community Data

First, we aggregate the data by community ID to get an overview of each community's composition. We group our DataFrame by the community ID and then examine the types and names within each community.


In [24]:
# Group by community ID and list out members of each community
community_groups = node_df_louvain.groupby('communityID')

# Example to print out the composition of each community
for community_id, group in community_groups:
    print(f"Community ID: {community_id}")
    print(f"Members count: {len(group)}")
    print(group[['nodeID', 'name', 'type']].head())  # Adjust as needed
    print("\n")


Community ID: 0
Members count: 3387
  nodeID              name type
3   t_81             indie  tag
4   t_79       alternative  tag
6  t_292              folk  tag
8   t_78  alternative rock  tag
9   t_84        indie rock  tag


Community ID: 1
Members count: 3385
   nodeID          name type
2    t_18    electronic  tag
5    t_33  experimental  tag
7    t_14       ambient  tag
11   t_13      chillout  tag
13  t_187   electronica  tag


Community ID: 2
Members count: 3115
   nodeID         name type
17    t_1        metal  tag
18  t_376    metalcore  tag
20   t_72    hard rock  tag
28  t_386  heavy metal  tag
33    t_5  death metal  tag


Community ID: 3
Members count: 2406
   nodeID        name type
19  t_625        soul  tag
45  t_308       latin  tag
55  t_512    romantic  tag
61  t_199  electropop  tag
88  t_349    teen pop  tag


Community ID: 4
Members count: 2280
   nodeID      name type
0    t_24       pop  tag
1    t_73      rock  tag
12   t_25       80s  tag
15  t_352       

We can also get generate a summary table that lists each community ID along with the counts of tags, artists, and users in each community.

In [47]:
community_summary_louvain = node_df_louvain.groupby('communityID')['type'].value_counts().unstack(fill_value=0)

# Calculate the total number of entities in each community
community_summary_louvain['total_entities'] = community_summary_louvain.sum(axis=1)

# Reset the index to turn the index into a column
community_summary_louvain.reset_index(inplace=True)

# pd.set_option('display.max_rows', None)
# pd.set_option('display.max_columns', None)

pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')


# Display the summary table
community_summary_louvain


type,communityID,artist,tag,user,total_entities
0,0,2666,272,449,3387
1,1,2981,262,142,3385
2,2,2388,401,326,3115
3,3,1711,189,506,2406
4,4,1883,228,169,2280
...,...,...,...,...,...
522,522,1,1,0,2
523,523,1,0,1,2
524,524,1,1,0,2
525,525,1,1,0,2


#### Specific Community Analysis

In [82]:
community = node_df_louvain[node_df_louvain['communityID'] == 26]
# Count the occurrence of each type within the community
print(community['type'].value_counts())
# List all unique tags or artists within the community
print("Unique tags:", community[community['type'] == 'tag']['name'].unique())
print("Unique artists:", community[community['type'] == 'artist']['name'].unique())

type
artist    37
user       1
Name: count, dtype: int64
Unique tags: []
Unique artists: ['Turkish Military Band' 'Djelimady Tounkara' 'Talib al-Habib'
 'حسين الاعظمي' 'Houria Aïchi' 'grup genç' 'Ahmet Çalışır'
 'Abu Ali & Abu Abdul Maalik' 'Yusuf Islam' 'Shabnam Majid' 'DEBU'
 'Al-Aseef' 'Ahmed Bukhatir' 'Salif Keita' 'عماد الشريف' 'محمد منير'
 'Hasan Dursun' 'ميس شلش' 'Sajid Raza Qadri' 'Nass Marrakech' 'Noor'
 'Imad Al-Sharif' 'Shaam' 'Irfan Makki'
 'Mustafa Demirci & Murat Necipoğlu' 'Ali Farka Touré and Ry Cooder'
 'Ibrahim Jibreen' 'Abu Ali' 'Ottoman Military Band' 'Hadiqa Kiani'
 'Nusrat Fateh Ali Khan & Party' 'بوب مارلي' 'Firqat al-Isra'
 'Grup Yeniçağ' 'Maher Zain' "Hussein al A'dhami" 'Abdullah Rolle']


### **For Leiden**

#### Overview of Community Data

First, we aggregate the data by community ID to get an overview of each community's composition. We group our DataFrame by the community ID and then examine the types and names within each community.


In [32]:
# Group by community ID and list out members of each community
community_groups = node_df_leiden.groupby('communityID')

# Example to print out the composition of each community
for community_id, group in community_groups:
    print(f"Community ID: {community_id}")
    print(f"Members count: {len(group)}")
    print(group[['nodeID', 'name', 'type']].head())  # Adjust as needed
    print("\n")


Community ID: 0
Members count: 4650
  nodeID          name type
1   t_73          rock  tag
2   t_18    electronic  tag
3   t_81         indie  tag
4   t_79   alternative  tag
5   t_33  experimental  tag


Community ID: 1
Members count: 2862
   nodeID      name type
0    t_24       pop  tag
12   t_25       80s  tag
15  t_352       90s  tag
19  t_625      soul  tag
26   t_74  synthpop  tag


Community ID: 2
Members count: 2710
   nodeID              name type
17    t_1             metal  tag
20   t_72         hard rock  tag
24  t_387  progressive rock  tag
28  t_386       heavy metal  tag
33    t_5       death metal  tag


Community ID: 3
Members count: 2666
   nodeID        name type
25  t_109    pop rock  tag
45  t_308       latin  tag
55  t_512    romantic  tag
61  t_199  electropop  tag
88  t_349    teen pop  tag


Community ID: 4
Members count: 2603
   nodeID       name type
14  t_181       punk  tag
18  t_376  metalcore  tag
29  t_182  punk rock  tag
47  t_169   pop punk  tag
56  

We can also get generate a summary table that lists each community ID along with the counts of tags, artists, and users in each community.

In [48]:
community_summary_leiden = node_df_leiden.groupby('communityID')['type'].value_counts().unstack(fill_value=0)

# Calculate the total number of entities in each community
community_summary_leiden['total_entities'] = community_summary_leiden.sum(axis=1)

# Reset the index to turn the index into a column
community_summary_leiden.reset_index(inplace=True)

# Display the summary table
community_summary_leiden

type,communityID,artist,tag,user,total_entities
0,0,4075,347,228,4650
1,1,2402,315,145,2862
2,2,2105,388,217,2710
3,3,1923,205,538,2666
4,4,2136,210,257,2603
5,5,1721,208,374,2303
6,6,1022,98,46,1166
7,7,628,41,25,694
8,8,606,64,16,686
9,9,514,49,25,588


#### Specific Community Analysis

In [52]:
community = node_df_louvain[node_df_leiden['communityID'] == 0]
# Count the occurrence of each type within the community
print(community['type'].value_counts())
# List all unique tags or artists within the community
print("Unique tags:", community[community['type'] == 'tag']['name'].unique())
print("Unique artists:", community[community['type'] == 'artist']['name'].unique())

type
artist    4075
tag        347
user       228
Name: count, dtype: int64
Unique tags: ['rock' 'electronic' 'indie' 'alternative' 'experimental' 'folk' 'ambient'
 'alternative rock' 'indie rock' 'chillout' 'electronica' 'jazz'
 'indie pop' 'post-rock' 'britpop' 'chill' 'dark ambient' 'dream pop'
 'new age' 'world music' 'dreamy' 'relaxing' 'dubstep' '90210' 'post rock'
 'grunge' 'dark' 'my indie friends' 'indie luv' 'barkbarkdisco'
 'acoustic rock' 'indietronica' 'rock me babe' 'alt-country' 'space rock'
 'noise pop' 'dreampop' 'nu jazz' 'contemporary classical' 'acid jazz'
 'noise rock' 'indie folk' 'neofolk' 'synth-pop' 'anti-folk' 'smooth jazz'
 'jazz fusion' 'just chillin' 'math rock' 'light trip-hop'
 'in the witch house family' 'contemporary jazz'
 'berep guest dj closedmouth 239645' 'rockville ca'
 'in the chillwave family' 'dark disco' 'twilight' 'witch house'
 'free jazz' 'experimental rock' 'music for my soul' 'chill out'
 'freak folk'
 'the fact that this is the name of a 