# Experimental Analysis of Static Community Detection Using JUST

Running both the Louvain and Leiden algorithms multiple times and recording various statistics for each run can provide valuable insights for our analysis.

## Importing Edge List w/ Weights to NetworkX

NetworkX's read_weighted_edgelist function expects a simple text file with lines of the form <node1> <node2> <weight>, without headers. Since our data is in CSV format, you'll need to use Pandas (or another method) to load the CSV and adjust it to become readable.

In [1]:
import pandas as pd

edge_list_df = pd.read_csv('New Input/JUST_edge_list_with_similarity.csv')

print(edge_list_df)

        source  target    weight
0          u_2    a_51  0.999056
1          u_2    a_52  0.944903
2          u_2    a_53  0.992538
3          u_2    a_54  0.692012
4          u_2    a_55  0.999549
...        ...     ...       ...
160877  u_2099  u_1801  0.996204
160878  u_2099  u_2006  0.996046
160879  u_2099  u_2016  0.996104
160880  u_2100   u_586  0.998411
160881  u_2100   u_607  0.998568

[160882 rows x 3 columns]


In [2]:
negative_weights = edge_list_df[edge_list_df['weight'] < 0]
print(f"Number of edges with negative weights: {len(negative_weights)}")

Number of edges with negative weights: 17409


Since Louvain is not made to consider negative edge weights, we will rescale the weights such that instead of [-1, 1] being the range, it is now [0, 1], where 0 now represents perfect dissimilarity, 0.5 represents orthogonality, and 1 represents perfect similarity.

In [3]:
edge_list_df['weight'] = (edge_list_df['weight'] + 1) / 2

print(edge_list_df)

        source  target    weight
0          u_2    a_51  0.999528
1          u_2    a_52  0.972451
2          u_2    a_53  0.996269
3          u_2    a_54  0.846006
4          u_2    a_55  0.999775
...        ...     ...       ...
160877  u_2099  u_1801  0.998102
160878  u_2099  u_2006  0.998023
160879  u_2099  u_2016  0.998052
160880  u_2100   u_586  0.999206
160881  u_2100   u_607  0.999284

[160882 rows x 3 columns]


Before we continue with the creation of a graph, NetworkX specifies that an undirected, weighted graph must not have self-loop, parallel edges (A->B, B->A), or duplicate edges.

In [4]:
duplicate_edges = edge_list_df.duplicated(subset=['source', 'target'], keep=False)
print(f"Number of duplicate edges: {duplicate_edges.sum()}")

self_loops = edge_list_df[edge_list_df['source'] == edge_list_df['target']]
print(f"Number of self-loops: {len(self_loops)}")

print(edge_list_df.isnull().sum())

Number of duplicate edges: 0
Number of self-loops: 0
source    0
target    0
weight    0
dtype: int64


In [5]:
# Find duplicate edges (ignoring the weight column)
duplicate_edges = edge_list_df.duplicated(subset=['source', 'target'], keep=False)

# Filter to get only the duplicate edges
parallel_edges_df = edge_list_df[duplicate_edges]

# Sort to better visualize parallel edges
parallel_edges_sorted = parallel_edges_df.sort_values(by=['source', 'target'])

print(parallel_edges_sorted)

Empty DataFrame
Columns: [source, target, weight]
Index: []


## Creating Undirected Weighted NX Graph

We iterate over the edge list DataFrame rows to add edges along with their weights to a new NetworkX graph.


In [6]:
import networkx as nx

def get_graph_info(graph):
    print("Number of nodes:", graph.number_of_nodes())
    print("Number of edges:", graph.number_of_edges())
    
    # Checking the graph type to provide appropriate information
    if isinstance(graph, nx.DiGraph):
        print("Graph is Directed")
    else:
        print("Graph is Undirected")


In [7]:
# Initialize a new graph
G = nx.MultiGraph()

# Add edges and weights
for index, row in edge_list_df.iterrows():
    source = row['source']
    target = row['target']
    weight = row['weight']
    
    # Add the edge with weight
    G.add_edge(source, target, weight=weight)

In [8]:
get_graph_info(G)

Number of nodes: 21518
Number of edges: 160882
Graph is Undirected


## Testing Modularity, Run Time and No. of Communities

- Iterates 10 times, running both the Louvain and Leiden using CDLib algorithms on each iteration.
- Records modularity, number of communities, and execution time for each run.
- Calculates the average modularity, average number of communities, and average execution time for both algorithms across all runs.
- Stores all this information in the results dictionary for easy access and analysis.

In [9]:
from cdlib import algorithms
import time
import numpy as np

# Prepare the data structure for results
results = {
    "Louvain": {"modularity": [], "communities": [], "time": []},
    "Leiden": {"modularity": [], "communities": [], "time": []}
}

# Execute each algorithm 10 times
for _ in range(10):
    # Louvain
    start_time = time.time()
    communities_louvain = algorithms.louvain(G, weight='weight')
    elapsed_time = time.time() - start_time
    results["Louvain"]["modularity"].append(communities_louvain.newman_girvan_modularity().score)
    results["Louvain"]["communities"].append(len(communities_louvain.communities))
    results["Louvain"]["time"].append(elapsed_time)
    
    # Leiden
    start_time = time.time()
    communities_leiden = algorithms.leiden(G, weights='weight')
    elapsed_time = time.time() - start_time
    results["Leiden"]["modularity"].append(communities_leiden.newman_girvan_modularity().score)
    results["Leiden"]["communities"].append(len(communities_leiden.communities))
    results["Leiden"]["time"].append(elapsed_time)

# Calculate averages
for method in results:
    results[method]["avg_modularity"] = np.mean(results[method]["modularity"])
    results[method]["avg_time"] = np.mean(results[method]["time"])
    results[method]["avg_communities"] = np.mean(results[method]["communities"])


Note: to be able to use all crisp methods, you need to install some additional packages:  {'graph_tool', 'bayanpy', 'infomap', 'wurlitzer'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'ASLPAw', 'pyclustering'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'infomap', 'wurlitzer'}


In [10]:
# Display the results
print(results)

{'Louvain': {'modularity': [0.48787245174148963, 0.4875821610607214, 0.4798222465884821, 0.4849485185003455, 0.4826929309229858, 0.4623715573684709, 0.5085587918735801, 0.47969852009227737, 0.4917758562218015, 0.4767517403654994], 'communities': [561, 533, 594, 575, 549, 639, 15, 571, 562, 625], 'time': [10.978015899658203, 13.526695966720581, 13.75668478012085, 10.96618914604187, 12.305590867996216, 13.670734882354736, 11.65299916267395, 10.841839075088501, 8.893937826156616, 12.886661291122437], 'avg_modularity': 0.48420747747356535, 'avg_time': 11.947934889793396, 'avg_communities': 522.4}, 'Leiden': {'modularity': [0.5141272181100943, 0.5171112572122496, 0.5177108822604298, 0.5113865530975705, 0.5171342734585987, 0.5135731165599291, 0.5136714602877601, 0.5163515187791415, 0.5132695249597226, 0.517785317101375], 'communities': [15, 16, 12, 14, 14, 14, 19, 15, 16, 14], 'time': [3.9380009174346924, 3.358065128326416, 2.926163911819458, 3.0199339389801025, 3.1228387355804443, 2.9621839

In [11]:
# Convert the results dictionary into a pandas DataFrame
# First, prepare the data in a structured form
data = {
    "Algorithm": [],
    "Run": [],
    "Modularity": [],
    "Communities": [],
    "Time (s)": []
}

# Populate the structured data from the results
for algo in results:
    for run in range(10):  # Assuming 10 runs as previously set
        data["Algorithm"].append(algo)
        data["Run"].append(run + 1)  # Run number (1-10)
        data["Modularity"].append(results[algo]["modularity"][run])
        data["Communities"].append(results[algo]["communities"][run])
        data["Time (s)"].append(results[algo]["time"][run])

# Creating the DataFrame
results_df = pd.DataFrame(data)

# Display the DataFrame for visual inspection
print(results_df)

# Additionally, creating a summary DataFrame for averages
summary_data = {
    "Algorithm": ["Louvain", "Leiden"],
    "Avg. Modularity": [results["Louvain"]["avg_modularity"], results["Leiden"]["avg_modularity"]],
    "Avg. Communities": [results["Louvain"]["avg_communities"], results["Leiden"]["avg_communities"]],
    "Avg. Time (s)": [results["Louvain"]["avg_time"], results["Leiden"]["avg_time"]]
}

summary_df = pd.DataFrame(summary_data)

# Display the summary DataFrame
print(summary_df)

   Algorithm  Run  Modularity  Communities   Time (s)
0    Louvain    1    0.487872          561  10.978016
1    Louvain    2    0.487582          533  13.526696
2    Louvain    3    0.479822          594  13.756685
3    Louvain    4    0.484949          575  10.966189
4    Louvain    5    0.482693          549  12.305591
5    Louvain    6    0.462372          639  13.670735
6    Louvain    7    0.508559           15  11.652999
7    Louvain    8    0.479699          571  10.841839
8    Louvain    9    0.491776          562   8.893938
9    Louvain   10    0.476752          625  12.886661
10    Leiden    1    0.514127           15   3.938001
11    Leiden    2    0.517111           16   3.358065
12    Leiden    3    0.517711           12   2.926164
13    Leiden    4    0.511387           14   3.019934
14    Leiden    5    0.517134           14   3.122839
15    Leiden    6    0.513573           14   2.962184
16    Leiden    7    0.513671           19   2.931429
17    Leiden    8    0.51635

In [12]:
# Save the DataFrame to a CSV file
csv_filename = 'JUST_CD_Experiment_Values.csv'
results_df.to_csv(csv_filename, index=False)