# Experimental Analysis of Static Community Detection Using JUST

Running both the Louvain and Leiden algorithms multiple times and recording various statistics for each run can provide valuable insights for our analysis.

## Importing Edge List w/ Weights to NetworkX

NetworkX's read_weighted_edgelist function expects a simple text file with lines of the form <node1> <node2> <weight>, without headers. Since our data is in CSV format, you'll need to use Pandas (or another method) to load the CSV and adjust it to become readable.

In [1]:
import pandas as pd

edge_list_df = pd.read_csv('New Input/JUST_edge_list_with_similarity.csv')

print(edge_list_df)

       source  target    weight
0          p0      a1  0.810501
1          p0      a2  0.825994
2          p0      a3  0.846090
3          p0      a4  0.766287
4          p0      a5  0.839626
...       ...     ...       ...
51372  p10169   p8094  0.586715
51373  p10169   p7974  0.669212
51374  p10169   p5852  0.599799
51375  p10169  p10113  0.639972
51376  p10169  p10031  0.488926

[51377 rows x 3 columns]


In [2]:
negative_weights = edge_list_df[edge_list_df['weight'] < 0]
print(f"Number of edges with negative weights: {len(negative_weights)}")

Number of edges with negative weights: 21


Since Louvain is not made to consider negative edge weights, we will rescale the weights such that instead of [-1, 1] being the range, it is now [0, 1], where 0 now represents perfect dissimilarity, 0.5 represents orthogonality, and 1 represents perfect similarity.

In [3]:
edge_list_df['weight'] = (edge_list_df['weight'] + 1) / 2

print(edge_list_df)

       source  target    weight
0          p0      a1  0.905251
1          p0      a2  0.912997
2          p0      a3  0.923045
3          p0      a4  0.883144
4          p0      a5  0.919813
...       ...     ...       ...
51372  p10169   p8094  0.793358
51373  p10169   p7974  0.834606
51374  p10169   p5852  0.799899
51375  p10169  p10113  0.819986
51376  p10169  p10031  0.744463

[51377 rows x 3 columns]


Before we continue with the creation of a graph, NetworkX specifies that an undirected, weighted graph must not have self-loop, parallel edges (A->B, B->A), or duplicate edges.

In [4]:
duplicate_edges = edge_list_df.duplicated(subset=['source', 'target'], keep=False)
print(f"Number of duplicate edges: {duplicate_edges.sum()}")

self_loops = edge_list_df[edge_list_df['source'] == edge_list_df['target']]
print(f"Number of self-loops: {len(self_loops)}")
print(self_loops)

print(edge_list_df.isnull().sum())

Number of duplicate edges: 0
Number of self-loops: 4
      source target  weight
2030    p661   p661     1.0
18335  p4680  p4680     1.0
25834  p6225  p6225     1.0
45560  p9359  p9359     1.0
source    0
target    0
weight    0
dtype: int64


In [5]:
print(f"Number of edges before dropping self-loops: {len(edge_list_df)}")

edge_list_df = edge_list_df[edge_list_df['source'] != edge_list_df['target']]

print(f"Number of edges after dropping self-loops: {len(edge_list_df)}")

Number of edges before dropping self-loops: 51377
Number of edges after dropping self-loops: 51373


In [6]:
# Find duplicate edges (ignoring the weight column)
duplicate_edges = edge_list_df.duplicated(subset=['source', 'target'], keep=False)

# Filter to get only the duplicate edges
parallel_edges_df = edge_list_df[duplicate_edges]

# Sort to better visualize parallel edges
parallel_edges_sorted = parallel_edges_df.sort_values(by=['source', 'target'])

print(parallel_edges_sorted)

Empty DataFrame
Columns: [source, target, weight]
Index: []


## Creating Undirected Weighted Graph

We iterate over the edge list DataFrame rows to add edges along with their weights to a new NetworkX graph.


In [7]:
import networkx as nx

def get_graph_info(graph):
    print("Number of nodes:", graph.number_of_nodes())
    print("Number of edges:", graph.number_of_edges())
    
    # Checking the graph type to provide appropriate information
    if isinstance(graph, nx.DiGraph):
        print("Graph is Directed")
    else:
        print("Graph is Undirected")


In [8]:
# Initialize a new graph
G = nx.MultiGraph()

# Add edges and weights
for index, row in edge_list_df.iterrows():
    source = row['source']
    target = row['target']
    weight = row['weight']
    
    # Add the edge with weight
    G.add_edge(source, target, weight=weight)

In [9]:
get_graph_info(G)

Number of nodes: 15649
Number of edges: 51373
Graph is Undirected


## Running Louvain Using CDLIB

CDlib (Community Discovery Library) is designed for community detection and analysis, providing easy access to various algorithms, including Louvain and Leiden, and tools for evaluating and visualizing the results.

In [10]:
from cdlib import algorithms
import time
import numpy as np

# Prepare the data structure for results
results = {
    "Louvain": {"modularity": [], "communities": [], "time": []},
    "Leiden": {"modularity": [], "communities": [], "time": []}
}

# Execute each algorithm 10 times
for _ in range(10):
    # Louvain
    start_time = time.time()
    communities_louvain = algorithms.louvain(G, weight='weight')
    elapsed_time = time.time() - start_time
    results["Louvain"]["modularity"].append(communities_louvain.newman_girvan_modularity().score)
    results["Louvain"]["communities"].append(len(communities_louvain.communities))
    results["Louvain"]["time"].append(elapsed_time)
    
    # Leiden
    start_time = time.time()
    communities_leiden = algorithms.leiden(G, weights='weight')
    elapsed_time = time.time() - start_time
    results["Leiden"]["modularity"].append(communities_leiden.newman_girvan_modularity().score)
    results["Leiden"]["communities"].append(len(communities_leiden.communities))
    results["Leiden"]["time"].append(elapsed_time)

# Calculate averages
for method in results:
    results[method]["avg_modularity"] = np.mean(results[method]["modularity"])
    results[method]["avg_time"] = np.mean(results[method]["time"])
    results[method]["avg_communities"] = np.mean(results[method]["communities"])


Note: to be able to use all crisp methods, you need to install some additional packages:  {'infomap', 'wurlitzer', 'bayanpy', 'graph_tool'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'ASLPAw', 'pyclustering'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'infomap', 'wurlitzer'}


In [11]:
# Display the results
print(results)

{'Louvain': {'modularity': [0.5957159635089166, 0.5805000889617737, 0.6463605882871974, 0.588060428829819, 0.616711752889861, 0.6010318802797675, 0.5968676812202742, 0.5854100176117693, 0.6035033660424804, 0.6224959471735716], 'communities': [1820, 1898, 1581, 1819, 1689, 1785, 1831, 1867, 1739, 1663], 'time': [11.768126964569092, 10.624771118164062, 5.055445909500122, 5.814736843109131, 5.640079975128174, 5.766633987426758, 7.036201238632202, 9.15189504623413, 6.269519090652466, 8.56113314628601], 'avg_modularity': 0.6036657714805431, 'avg_time': 7.568854331970215, 'avg_communities': 1769.2}, 'Leiden': {'modularity': [0.7428975802369394, 0.7310618570501866, 0.734277640290517, 0.7413171119681695, 0.7365376343645264, 0.7350904352275499, 0.7369605110560752, 0.737211348320968, 0.7390722449437401, 0.7400887605465057], 'communities': [29, 29, 27, 31, 26, 28, 28, 27, 26, 26], 'time': [1.3771846294403076, 1.2270317077636719, 1.4036669731140137, 1.3829102516174316, 1.4404339790344238, 1.259916

In [12]:
# Convert the results dictionary into a pandas DataFrame
# First, prepare the data in a structured form
data = {
    "Algorithm": [],
    "Run": [],
    "Modularity": [],
    "Communities": [],
    "Time (s)": []
}

# Populate the structured data from the results
for algo in results:
    for run in range(10):  # Assuming 10 runs as previously set
        data["Algorithm"].append(algo)
        data["Run"].append(run + 1)  # Run number (1-10)
        data["Modularity"].append(results[algo]["modularity"][run])
        data["Communities"].append(results[algo]["communities"][run])
        data["Time (s)"].append(results[algo]["time"][run])

# Creating the DataFrame
results_df = pd.DataFrame(data)

# Display the DataFrame for visual inspection
print(results_df)

# Additionally, creating a summary DataFrame for averages
summary_data = {
    "Algorithm": ["Louvain", "Leiden"],
    "Avg. Modularity": [results["Louvain"]["avg_modularity"], results["Leiden"]["avg_modularity"]],
    "Avg. Communities": [results["Louvain"]["avg_communities"], results["Leiden"]["avg_communities"]],
    "Avg. Time (s)": [results["Louvain"]["avg_time"], results["Leiden"]["avg_time"]]
}

summary_df = pd.DataFrame(summary_data)

# Display the summary DataFrame
print(summary_df)

   Algorithm  Run  Modularity  Communities   Time (s)
0    Louvain    1    0.595716         1820  11.768127
1    Louvain    2    0.580500         1898  10.624771
2    Louvain    3    0.646361         1581   5.055446
3    Louvain    4    0.588060         1819   5.814737
4    Louvain    5    0.616712         1689   5.640080
5    Louvain    6    0.601032         1785   5.766634
6    Louvain    7    0.596868         1831   7.036201
7    Louvain    8    0.585410         1867   9.151895
8    Louvain    9    0.603503         1739   6.269519
9    Louvain   10    0.622496         1663   8.561133
10    Leiden    1    0.742898           29   1.377185
11    Leiden    2    0.731062           29   1.227032
12    Leiden    3    0.734278           27   1.403667
13    Leiden    4    0.741317           31   1.382910
14    Leiden    5    0.736538           26   1.440434
15    Leiden    6    0.735090           28   1.259917
16    Leiden    7    0.736961           28   1.403141
17    Leiden    8    0.73721

In [13]:
# Save the DataFrame to a CSV file
csv_filename = 'JUST_CD_Experiment_Values.csv'
results_df.to_csv(csv_filename, index=False)