# Experimental Analysis of Static Community Detection Using M2V

Running both the Louvain and Leiden algorithms multiple times and recording various statistics for each run can provide valuable insights for our analysis.

## Importing Edge List w/ Weights to NetworkX

NetworkX's read_weighted_edgelist function expects a simple text file with lines of the form <node1> <node2> <weight>, without headers. Since our data is in CSV format, you'll need to use Pandas (or another method) to load the CSV and adjust it to become readable.

In [1]:
import pandas as pd

edge_list_df = pd.read_csv('Input/M2V_edge_list_with_similarity.csv')

print(edge_list_df)

            source    target    weight
0        u74717431  t7748381  0.892881
1       u127821914  t3529910  0.717761
2       u174194590  t5762915  0.483259
3       u141847381  t6987845  0.720264
4        u87215499  t4082536  0.754541
...            ...       ...       ...
641138     ci20717       co3  0.772626
641139     ci20718       co9  0.878173
641140     ci20719       co5  0.737660
641141     ci20720       co3  0.751880
641142     ci20721       co4  0.766678

[641143 rows x 3 columns]


In [2]:
negative_weights = edge_list_df[edge_list_df['weight'] < 0]
print(f"Number of edges with negative weights: {len(negative_weights)}")

Number of edges with negative weights: 0


Since Louvain is not made to consider negative edge weights, we will rescale the weights such that instead of [-1, 1] being the range, it is now [0, 1], where 0 now represents perfect dissimilarity, 0.5 represents orthogonality, and 1 represents perfect similarity.

In [3]:
edge_list_df['weight'] = (edge_list_df['weight'] + 1) / 2

print(edge_list_df)

            source    target    weight
0        u74717431  t7748381  0.946440
1       u127821914  t3529910  0.858880
2       u174194590  t5762915  0.741629
3       u141847381  t6987845  0.860132
4        u87215499  t4082536  0.877271
...            ...       ...       ...
641138     ci20717       co3  0.886313
641139     ci20718       co9  0.939086
641140     ci20719       co5  0.868830
641141     ci20720       co3  0.875940
641142     ci20721       co4  0.883339

[641143 rows x 3 columns]


Before we continue with the creation of a graph, NetworkX specifies that an undirected, weighted graph must not have self-loop, parallel edges (A->B, B->A), or duplicate edges.

In [4]:
duplicate_edges = edge_list_df.duplicated(subset=['source', 'target'], keep=False)
print(f"Number of duplicate edges: {duplicate_edges.sum()}")

self_loops = edge_list_df[edge_list_df['source'] == edge_list_df['target']]
print(f"Number of self-loops: {len(self_loops)}")

print(edge_list_df.isnull().sum())

Number of duplicate edges: 0
Number of self-loops: 0
source    0
target    0
weight    0
dtype: int64


In [5]:
# Find duplicate edges (ignoring the weight column)
duplicate_edges = edge_list_df.duplicated(subset=['source', 'target'], keep=False)

# Filter to get only the duplicate edges
parallel_edges_df = edge_list_df[duplicate_edges]

# Sort to better visualize parallel edges
parallel_edges_sorted = parallel_edges_df.sort_values(by=['source', 'target'])

print(parallel_edges_sorted)

Empty DataFrame
Columns: [source, target, weight]
Index: []


We check for any non-numeric values in the weight column, since this will not be valid when input into a graph object.

In [6]:
# Check for any non-numeric values in the 'weight' column
non_numeric_weights = edge_list_df[pd.to_numeric(edge_list_df['weight'], errors='coerce').isna()]

# Display rows with non-numeric or NaN weights
print(non_numeric_weights)

Empty DataFrame
Columns: [source, target, weight]
Index: []


In [7]:
# Convert the 'weight' column to floating point values
edge_list_df['weight'] = edge_list_df['weight'].astype(float)

# Check the data type of the column to confirm the conversion
print(edge_list_df.dtypes)

source     object
target     object
weight    float64
dtype: object


## Creating Undirected Weighted NX Graph for Louvain

We iterate over the edge list DataFrame rows to add edges along with their weights to a new NetworkX graph.


In [8]:
import networkx as nx

def get_graph_info(graph):
    print("Number of nodes:", graph.number_of_nodes())
    print("Number of edges:", graph.number_of_edges())
    
    # Checking the graph type to provide appropriate information
    if isinstance(graph, nx.DiGraph):
        print("Graph is Directed")
    else:
        print("Graph is Undirected")


In [9]:
# Initialize a new graph
G = nx.MultiGraph()

# Add edges and weights
for index, row in edge_list_df.iterrows():
    source = row['source']
    target = row['target']
    weight = row['weight']
    
    # Add the edge with weight
    G.add_edge(source, target, weight=weight)

In [10]:
get_graph_info(G)

Number of nodes: 245621
Number of edges: 641143
Graph is Undirected


## Creating Undirected Weighted iGraph for Leiden

Provided that running Leiden on the MusicMicro dataset was problematic, we decided to isolate the issue and directly use iGraph, as suggested by Leiden's authors.  

We iterate over the edge list DataFrame rows to add edges into a tuple list, along with their weights in a separate list for input into a new iGraph graph.

In [11]:
import igraph as ig

# Assuming edge_list_df is your DataFrame
edges_with_weights = [(row['source'], row['target'], row['weight']) for index, row in edge_list_df.iterrows()]

# Creating the igraph Graph
g = ig.Graph.TupleList(edges_with_weights, edge_attrs={'weight': [w for _, _, w in edges_with_weights]})

# Now, check again if the weights have been correctly assigned
print(g.es['weight'][:5])
print(g.summary())

[0.9464404929422645, 0.8588802696312696, 0.7416292808281564, 0.8601320326015599, 0.8772707346584033]
IGRAPH UNW- 245621 641143 -- 
+ attr: name (v), weight (e)


## Testing Modularity, Run Time and No. of Communities

- Iterates 10 times, running both the Louvain (using CDLib) and Leiden (using iGraph) algorithms on each iteration.
- Records modularity, number of communities, and execution time for each run.
- Calculates the average modularity, average number of communities, and average execution time for both algorithms across all runs.
- Stores all this information in the results dictionary for easy access and analysis.

In [12]:
import time
import numpy as np
from cdlib import algorithms
import leidenalg

# Prepare lists to store results
results = {
    "Louvain": {"modularity": [], "communities": [], "time": []},
    "Leiden": {"modularity": [], "communities": [], "time": []}
}

# Execute each algorithm 10 times
for _ in range(10):
    # Louvain
    start_time = time.time()
    communities_louvain = algorithms.louvain(G)
    elapsed_time = time.time() - start_time
    results["Louvain"]["modularity"].append(communities_louvain.newman_girvan_modularity().score)
    results["Louvain"]["communities"].append(len(communities_louvain.communities))
    results["Louvain"]["time"].append(elapsed_time)
    
    # Leiden
    start_time = time.time()
    partition = leidenalg.find_partition(g, partition_type=leidenalg.ModularityVertexPartition, weights='weight')
    elapsed_time = time.time() - start_time
    results["Leiden"]["modularity"].append(partition.modularity)
    results["Leiden"]["communities"].append(len(partition))
    results["Leiden"]["time"].append(elapsed_time)

# Calculate averages
for method in results:
    results[method]["avg_modularity"] = np.mean(results[method]["modularity"])
    results[method]["avg_time"] = np.mean(results[method]["time"])
    results[method]["avg_communities"] = np.mean(results[method]["communities"])

# Print or store the results as needed
print(results)

Note: to be able to use all crisp methods, you need to install some additional packages:  {'graph_tool', 'wurlitzer', 'infomap', 'bayanpy'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'ASLPAw', 'pyclustering'}
Note: to be able to use all crisp methods, you need to install some additional packages:  {'wurlitzer', 'infomap'}
{'Louvain': {'modularity': [0.7107906293333608, 0.7146033550336707, 0.7100781811219058, 0.711286922818838, 0.7104126879038287, 0.7107213783426581, 0.7112670943582213, 0.7140453817597964, 0.7109466649870047, 0.7102647930672743], 'communities': [270, 407, 323, 317, 276, 314, 346, 131, 271, 378], 'time': [140.05474710464478, 185.6614499092102, 155.41062927246094, 143.72338771820068, 165.8080689907074, 114.40305089950562, 145.0077781677246, 146.19717693328857, 183.72993516921997, 167.2318778038025], 'avg_modularity': 0.7114417088726559, 'avg_time': 154.72281019687654, 'avg_communities': 303.3}, 'Leiden': {'modularity': [0.60

In [13]:
print(results)

{'Louvain': {'modularity': [0.7107906293333608, 0.7146033550336707, 0.7100781811219058, 0.711286922818838, 0.7104126879038287, 0.7107213783426581, 0.7112670943582213, 0.7140453817597964, 0.7109466649870047, 0.7102647930672743], 'communities': [270, 407, 323, 317, 276, 314, 346, 131, 271, 378], 'time': [140.05474710464478, 185.6614499092102, 155.41062927246094, 143.72338771820068, 165.8080689907074, 114.40305089950562, 145.0077781677246, 146.19717693328857, 183.72993516921997, 167.2318778038025], 'avg_modularity': 0.7114417088726559, 'avg_time': 154.72281019687654, 'avg_communities': 303.3}, 'Leiden': {'modularity': [0.6061964131056568, 0.6077849209284348, 0.6039175958946954, 0.6085297553859685, 0.6031347141821898, 0.6097619507061808, 0.6060376328682938, 0.6013012777554784, 0.6025305536021106, 0.6052087022958809], 'communities': [314, 291, 282, 219, 382, 246, 307, 287, 348, 254], 'time': [15.384473085403442, 16.508063793182373, 15.92988896369934, 17.097694158554077, 16.93193292617798, 1

In [14]:
# Convert the results dictionary into a pandas DataFrame
# First, prepare the data in a structured form
data = {
    "Algorithm": [],
    "Run": [],
    "Modularity": [],
    "Communities": [],
    "Time (s)": []
}

# Populate the structured data from the results
for algo in results:
    for run in range(10):  # Assuming 10 runs as previously set
        data["Algorithm"].append(algo)
        data["Run"].append(run + 1)  # Run number (1-10)
        data["Modularity"].append(results[algo]["modularity"][run])
        data["Communities"].append(results[algo]["communities"][run])
        data["Time (s)"].append(results[algo]["time"][run])

# Creating the DataFrame
results_df = pd.DataFrame(data)

# Display the DataFrame for visual inspection
print(results_df)

# Additionally, creating a summary DataFrame for averages
summary_data = {
    "Algorithm": ["Louvain", "Leiden"],
    "Avg. Modularity": [results["Louvain"]["avg_modularity"], results["Leiden"]["avg_modularity"]],
    "Avg. Communities": [results["Louvain"]["avg_communities"], results["Leiden"]["avg_communities"]],
    "Avg. Time (s)": [results["Louvain"]["avg_time"], results["Leiden"]["avg_time"]]
}

summary_df = pd.DataFrame(summary_data)

# Display the summary DataFrame
print(summary_df)

   Algorithm  Run  Modularity  Communities    Time (s)
0    Louvain    1    0.710791          270  140.054747
1    Louvain    2    0.714603          407  185.661450
2    Louvain    3    0.710078          323  155.410629
3    Louvain    4    0.711287          317  143.723388
4    Louvain    5    0.710413          276  165.808069
5    Louvain    6    0.710721          314  114.403051
6    Louvain    7    0.711267          346  145.007778
7    Louvain    8    0.714045          131  146.197177
8    Louvain    9    0.710947          271  183.729935
9    Louvain   10    0.710265          378  167.231878
10    Leiden    1    0.606196          314   15.384473
11    Leiden    2    0.607785          291   16.508064
12    Leiden    3    0.603918          282   15.929889
13    Leiden    4    0.608530          219   17.097694
14    Leiden    5    0.603135          382   16.931933
15    Leiden    6    0.609762          246   17.445023
16    Leiden    7    0.606038          307   15.634144
17    Leid

In [15]:
# Save the DataFrame to a CSV file
csv_filename = 'M2V_CD_Experiment_Values.csv'
results_df.to_csv(csv_filename, index=False)