# Community Detection Tutorial üï∏Ô∏è

Welcome to this comprehensive tutorial on community detection in graphs! In this notebook, we'll explore different methods to identify communities (clusters) in network graphs.

## Concepts Covered:
- **Community Detection**: Finding groups of nodes that are more densely connected to each other than to the rest of the network
- **Node Embedding**: Converting graph nodes into vector representations that capture structural relationships
- **Clustering Algorithms**: Grouping nodes based on their embeddings or structural properties
- **Dimension Reduction**: Visualizing high-dimensional embeddings in 2D space

Community detection can help identify:
- Groups of users with similar interests or connections
- Potential targets for targeted marketing or content
- Information flow patterns and influence propagation
- Network resilience and critical nodes

## What You'll Learn:
1. Graph-based community detection (Girvan-Newman, Louvain algorithms)
2. Node embedding techniques (Node2Vec)
3. Visualizing communities in 2D using t-SNE
4. Comparing different clustering approaches
5. Evaluating community quality with metrics


In [None]:
import collections
import warnings

import matplotlib.cm as cm
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from networkx.algorithms.community.quality import modularity
from node2vec import Node2Vec
from openTSNE import TSNE
from plotly.subplots import make_subplots
from sklearn.cluster import DBSCAN, KMeans, OPTICS

warnings.filterwarnings('ignore')

## Utility Functions

Let's define some helper functions for formatting and visualizing communities:


In [None]:
def format_comp(comp):
    """
    Convert a list of communities into a partition dictionary.
    
    Args:
        comp: List of communities, where each community is a list of nodes
        
    Returns:
        partition: Dictionary mapping node -> community_id
    """
    partition = {}
    for id_cluster, community in enumerate(comp):
        for user in community:
            partition[user] = id_cluster + 1
    return partition

def plot_graph_with_communities(G, partition, title="Graph with Communities", figsize=(12, 8)):
    """
    Visualize a graph with nodes colored by their community membership.
    Each community gets a distinct color.
    
    Args:
        G: NetworkX graph
        partition: Either:
            - List (or iterable) of sets of nodes, as returned by nx.community.louvain_communities
            - Dictionary mapping node -> community_id
        title: Plot title
        figsize: Figure size tuple
    """
    plt.figure(figsize=figsize)
    
    # Compute layout for nodes
    pos = nx.layout.forceatlas2_layout(G)
    
    # Handle different partition formats
    if isinstance(partition, dict):
        # Partition is already a dictionary: node -> community_id
        node_to_comm_raw = partition
    else:
        # Partition is a list of sets: convert to dictionary
        node_to_comm_raw = {}
        for comm_id, community in enumerate(partition):
            for node in community:
                node_to_comm_raw[node] = comm_id
    
    # Normalize community IDs to be sequential (0, 1, 2, ...)
    # This ensures each community gets a distinct color from the colormap
    unique_comm_ids = sorted(set(node_to_comm_raw.values()))
    comm_id_mapping = {old_id: new_id for new_id, old_id in enumerate(unique_comm_ids)}
    node_to_comm = {node: comm_id_mapping[comm_id] for node, comm_id in node_to_comm_raw.items()}
    
    # Get number of communities (after normalization)
    num_communities = len(unique_comm_ids)
    
    # Choose colormap based on number of communities
    # Use tab20 for up to 20 communities, otherwise use a larger colormap
    if num_communities <= 20:
        cmap = plt.cm.get_cmap('tab20')
    elif num_communities <= 40:
        # Combine tab20 twice for more colors
        colors1 = plt.cm.tab20(np.linspace(0, 1, 20))
        colors2 = plt.cm.tab20b(np.linspace(0, 1, 20))
        from matplotlib.colors import ListedColormap
        cmap = ListedColormap(np.vstack([colors1, colors2]))
    else:
        # Use a continuous colormap for many communities
        cmap = plt.cm.get_cmap('nipy_spectral')
    
    # Assign color to each node based on its community
    node_colors = [node_to_comm.get(node, -1) for node in G.nodes()]
    
    # Draw nodes with proper color scaling
    nx.draw_networkx_nodes(
        G, pos,
        node_size=100,
        node_color=node_colors,
        cmap=cmap,
        vmin=-0.5,  # Slightly below 0 to handle missing nodes
        vmax=num_communities - 0.5,  # Slightly above max to center colors
        alpha=0.85
    )
    
    # Draw edges
    nx.draw_networkx_edges(G, pos, alpha=0.3, width=0.5)
    
    # Optionally draw labels for small graphs
    if len(G.nodes()) < 50:
        nx.draw_networkx_labels(G, pos, font_size=8)
    
    plt.title(title, fontsize=14, fontweight='bold')
    plt.axis('off')
    plt.tight_layout()
    plt.show()

def print_community_stats(G, partition, name="Community"):
    """
    Print statistics about the detected communities.
    
    Args:
        G: NetworkX graph
        partition: List (or iterable) of sets of nodes, as returned by nx.community.louvain_communities
        name: Name prefix for the statistics
    """
    # Build a mapping node -> community_id for convenience
    node_to_comm = {}
    for comm_id, community in enumerate(partition):
        for node in community:
            node_to_comm[node] = comm_id

    print(f"\n{name} Statistics:")
    print(f"  Total nodes: {len(G.nodes())}")
    print(f"  Total edges: {len(G.edges())}")
    print(f"  Number of communities: {len(partition)}")
    print(f"\n  Community sizes:")
    for comm_id, community in enumerate(partition):
        print(f"    {name} {comm_id}: {len(community)} nodes")
        if len(community) <= 10:
            print(f"      Nodes: {', '.join(str(n) for n in list(community)[:10])}")

### Load the Les Mis√©rables character co-occurrence network
###### This network represents characters that appear together in scenes

In [None]:
G2 = nx.read_gml("../../data/lesmiserables.gml")

print(f"Graph loaded: {len(G2.nodes())} nodes, {len(G2.edges())} edges")
print(f"Graph density: {nx.density(G2):.4f}")

### Apply Girvan-Newman algorithm

Note: This returns an iterator, so we convert to list to see all levels

In [None]:
#!cat /opt/venv/lib/python3.11/site-packages/networkx/algorithms/community/centrality.py

In [None]:
comp = list(nx.algorithms.community.centrality.girvan_newman(G2))
print(f"\nNumber of decomposition levels: {len(comp)}")
print(f"Each level shows a different granularity of communities")

In [None]:
# Visualize different levels of the dendrogram
max_cluster = 10
print("Visualizing different levels of community decomposition:\n")

for level_dendrogramme, clusters in enumerate(comp):
    if level_dendrogramme + 1 >= max_cluster:
        break
    
    comp_formatted = tuple(sorted(c) for c in clusters)
    partition = format_comp(comp_formatted)
    
    print(f"Level {level_dendrogramme + 1}: {len(clusters)} communities")
    plot_graph_with_communities(
        G2, 
        partition, 
        title=f"Girvan-Newman Level {level_dendrogramme + 1} - {len(clusters)} Communities"
    )

## 1.2 Louvain Algorithm

The **Louvain algorithm** is a fast, heuristic method for community detection that optimizes modularity. It's one of the most popular community detection algorithms because it:
- Runs in near-linear time on sparse graphs
- Produces high-quality communities
- Can handle very large networks

### How it Works:
1. Starts with each node in its own community
2. Iteratively moves nodes to neighboring communities if it increases modularity
3. Aggregates communities into super-nodes
4. Repeats until no improvement is possible

### Modularity:
Modularity measures how much more densely connected nodes are within communities compared to a random network. Values range from -1 to 1, with higher values indicating stronger community structure.


In [None]:
# Load Game of Thrones character interaction network
# This dataset contains character interactions across all 8 seasons
df = pd.read_csv('../../data/got-s1-8-edges.csv')

# Create graph from edge list
G = nx.from_pandas_edgelist(df,
                            source='Source',
                            target='Target',
                            edge_attr=['Weight', 'Season'])

print(f"Game of Thrones Network:")
print(f"  Nodes (characters): {len(G.nodes())}")
print(f"  Edges (interactions): {len(G.edges())}")
print(f"  Graph density: {nx.density(G):.4f}")
print(f"  Average clustering: {nx.average_clustering(G):.4f}")
print(f"\n  Sample nodes: {list(G.nodes())[:10]}")

### Apply Louvain algorithm to detect communities

In [None]:
partition = nx.community.louvain_communities(G, resolution=1.5, seed=42)
# Construct a mapping node -> community_id
node_to_comm = {
    node: comm_id
    for comm_id, community in enumerate(partition)
    for node in community
}

# Print community statistics
print_community_stats(G, partition, name="Louvain")

### Visualizing Louvain Communities

Let's visualize the Game of Thrones network with nodes colored by their detected communities:


In [None]:
plot_graph_with_communities(
    G, 
    partition, 
    title="Game of Thrones Character Communities (Louvain Algorithm)",
    figsize=(14, 10)
)

#### On Facebook followers...

In [None]:
# load friends.graphml
G = nx.read_graphml("../../data/followers.graphml")
G = G.to_undirected()

# print the number of nodes and edges
print(f"Number of nodes: {len(G.nodes())}")
print(f"Number of edges: {len(G.edges())}")

# apply louvain algorithm
partition = nx.community.louvain_communities(G, resolution=3, seed=42)

len(partition)

In [None]:
# display name of the characters in the communities
for comm_id, community in enumerate(partition):
    if len(community) > 5:
        print("-----------------------")
        print(f"Community {comm_id}: {', '.join([G.nodes[node]['name'] for node in community])}")

In [None]:
# visualize the communities
plot_graph_with_communities(
    G, 
    partition, 
    title="Friends Character Communities (Louvain Algorithm)",
    figsize=(14, 10)
)

---

# Part II: Node Embedding-Based Community Detection

In this section, we'll use **node embeddings** to represent nodes as vectors in a high-dimensional space, then apply traditional clustering algorithms to detect communities.

## 2.1 Node2Vec Embedding

**Node2Vec** is a powerful algorithm that learns continuous feature representations for nodes in networks. It:
- Uses random walks to explore the network structure
- Applies the Skip-gram model (similar to Word2Vec) to learn embeddings
- Captures both local and global network properties
- Produces embeddings that preserve network neighborhoods

### Key Parameters:
- `dimensions`: Size of the embedding vector (typically 64-128)
- `walk_length`: Length of random walks
- `num_walks`: Number of walks per node
- `p`, `q`: Control the random walk behavior (BFS vs DFS exploration)

**Reference**: [Node2Vec GitHub](https://github.com/eliorc/node2vec)


In [None]:
%%time

# Create Node2Vec model
# dimensions=64 creates 64-dimensional vectors for each node
# workers=4 uses 4 parallel workers for faster computation
# seed=42 for reproducibility
node2vec = Node2Vec(G, dimensions=256, walk_length=30, num_walks=200, 
                    workers=4, seed=42, p=1, q=1)

print("Node2Vec model initialized!")
print(f"  Graph: {len(G.nodes())} nodes, {len(G.edges())} edges")
print(f"  Embedding dimensions: 64")
print(f"  Walk length: 30")
print(f"  Number of walks per node: 200")

In [None]:
%%time

# Train the Node2Vec model
# This generates random walks and trains the embedding model
model = node2vec.fit()

print("Model trained successfully!")
print(f"  Vocabulary size: {len(model.wv.key_to_index)}")
print(f"  Vector dimensions: {model.wv.vectors.shape}")

In [None]:
# Display the embedding vectors as a DataFrame
embeddings_df = pd.DataFrame(
    model.wv.vectors,
    index=model.wv.index_to_key
)
print(f"Embedding matrix shape: {embeddings_df.shape}")
print(f"\nFirst few rows:")
embeddings_df.head(10)

## 2.2 Exploring Node Similarities

One of the powerful features of node embeddings is that we can find nodes that are "similar" in the network structure, even if they're not directly connected. This is done by computing cosine similarity between embedding vectors.


In [None]:
# Find characters most similar to DROGO based on network structure
target_character = 'DROGO'
similar_nodes = model.wv.most_similar(target_character, topn=10)

print(f"Characters most similar to {target_character} (based on network embedding):\n")
for i, (character, similarity) in enumerate(similar_nodes, 1):
    print(f"{i:2d}. {character:15s} (similarity: {similarity:.4f})")

In [None]:
# Bonus: Check if these characters are in the same community
print(f"\nCommunity analysis:")
drogo_community = node_to_comm.get(target_character, -1)
print(f"{target_character} belongs to community {drogo_community}")
print(f"\nSimilar characters' communities:")
for character, similarity in similar_nodes[:10]:
    char_community = node_to_comm.get(character, -1)
    same_comm = "‚úì" if char_community == drogo_community else "‚úó"
    print(f"  {character:15s} -> Community {char_community} {same_comm}")

## 2.3 Dimension Reduction with t-SNE

While Node2Vec creates 256-dimensional vectors, we need to reduce this to 2D for visualization. **t-SNE** (t-distributed Stochastic Neighbor Embedding) is perfect for this:
- Preserves local neighborhoods (similar nodes stay close)
- Reveals clusters and community structure
- Creates beautiful 2D visualizations

**Reference**: [openTSNE documentation](https://github.com/pavlin-policar/openTSNE)


In [None]:
%%time

# Apply t-SNE to reduce 256D embeddings to 2D
tsne = TSNE(n_components=2, random_state=42)
X_embedded = tsne.fit(model.wv.vectors)

print(f"Dimension reduction complete!")
print(f"  Original dimensions: {model.wv.vectors.shape} (64D)")
print(f"  Reduced dimensions: {X_embedded.shape} (2D)")
print(f"  Reduction ratio: {model.wv.vectors.shape[1] / X_embedded.shape[1]}:1")

## 2.4 Visualizing Embedded Nodes

Now let's create an interactive visualization of the 2D embeddings, colored by the communities we found using the Louvain algorithm. This will help us see if the embedding-based representation aligns with the graph-based communities.


In [None]:
# Create DataFrame with embeddings and community information
df_emb = pd.DataFrame(X_embedded, columns=['x', 'y'])
df_emb['name'] = model.wv.index_to_key
df_emb['louvain_community'] = df_emb['name'].apply(lambda name: node_to_comm.get(name, -1))
df_emb['community_label'] = df_emb['louvain_community'].apply(
    lambda x: f'Community {x}' if x != -1 else 'Unknown'
)
# Display the dataframe
print("Embedding DataFrame with community labels:")
df_emb.head(10)

In [None]:
# Create interactive scatter plot with Plotly
fig = px.scatter(
    df_emb, 
    x='x', 
    y='y', 
    hover_name='name',
    color='community_label',
    color_discrete_sequence=px.colors.qualitative.Set3,
    title='Node Embeddings in 2D Space (colored by Louvain communities)',
    labels={'x': 't-SNE Dimension 1', 'y': 't-SNE Dimension 2'},
    height=800,
    width=1200
)

# Improve the layout
fig.update_traces(marker=dict(size=8, opacity=0.7, line=dict(width=0.5, color='DarkSlateGrey')))
fig.update_layout(
    title_font_size=16,
    hovermode='closest',
    showlegend=True
)

fig.show()

### üéØ Insights from the Visualization

- **Clustered nodes**: Nodes that are close in 2D space are similar in network structure
- **Community separation**: If Louvain communities form distinct clusters in the embedding space, it suggests the communities are well-defined
- **Outliers**: Nodes far from their community clusters might be bridges or have ambiguous community membership


## 2.5 Clustering on Embeddings

Now let's apply traditional clustering algorithms directly on the node embeddings to detect communities. We'll compare three different algorithms:

1. **K-Means**: Partition-based clustering, requires specifying the number of clusters
2. **DBSCAN**: Density-based clustering, finds clusters of varying shapes
3. **OPTICS**: Ordering points to identify clustering structure, similar to DBSCAN but handles varying densities

### Choosing the Right Embedding Space:
- **64D embeddings**: More accurate, captures full structural information
- **2D embeddings**: Less accurate but visualization matches the clustering

In [None]:
# Choose embedding space for clustering
# Option 1: Use 2D embeddings (matches visualization but less precise)
X = df_emb[['x', 'y']].values

# Option 2: Use 256D embeddings (more accurate but visualization won't match)
#X = model.wv.vectors

print(f"Using {'2D' if X.shape[1] == 2 else '64D'} embeddings for clustering")
print(f"Shape: {X.shape}")

# Define clustering algorithms
clusterings = [
    ('K-Means', KMeans(n_clusters=len(set(node_to_comm.values())), random_state=42, n_init=10)),
    ('DBSCAN', DBSCAN(min_samples=3, eps=2.0)),
    ('OPTICS', OPTICS(min_samples=3, metric='euclidean'))
]

# Store results
clustering_results = {}

for name, clustering_alg in clusterings:
    print(f"\n{'='*60}")
    print(f"Applying {name}...")
    print(f"{'='*60}")
    
    # Fit the clustering algorithm
    labels = clustering_alg.fit_predict(X)
    
    # Calculate number of clusters (excluding noise points labeled as -1)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1)
    
    print(f"  Number of clusters: {n_clusters}")
    if n_noise > 0:
        print(f"  Noise points: {n_noise}")
    
    # Calculate silhouette score (only if more than 1 cluster)
    if n_clusters > 1 and n_noise == 0:
        try:
            silhouette = silhouette_score(X, labels)
            print(f"  Silhouette score: {silhouette:.4f}")
        except:
            print(f"  Silhouette score: Cannot compute (noise points present)")
    
    # Store results
    clustering_results[name] = labels
    
    # Create visualization
    df_emb[f'{name}_cluster'] = labels
    df_emb[f'{name}_label'] = df_emb[f'{name}_cluster'].apply(
        lambda x: f'Cluster {x}' if x != -1 else 'Noise'
    )
    
    # Create plot
    fig = px.scatter(
        df_emb, 
        x='x', 
        y='y', 
        hover_name='name',
        color=f'{name}_label',
        title=f'{name} Clustering Results (on {"2D" if X.shape[1] == 2 else "64D"} embeddings)',
        labels={'x': 't-SNE Dimension 1', 'y': 't-SNE Dimension 2'},
        height=800,
        width=1200
    )
    
    fig.update_traces(marker=dict(size=8, opacity=0.7))
    fig.update_layout(title_font_size=16, hovermode='closest')
    fig.show()
    
    # Print cluster sizes
    cluster_sizes = pd.Series(labels).value_counts().sort_index()
    print(f"\n  Cluster sizes:")
    for cluster_id, size in cluster_sizes.items():
        if cluster_id == -1:
            print(f"    Noise: {size} nodes")
        else:
            print(f"    Cluster {cluster_id}: {size} nodes")


### Graph-Based Methods (Louvain, Girvan-Newman):
- ‚úÖ **Pros**: 
  - Work directly on graph structure
  - Optimize modularity (quality measure)
  - Fast and scalable (Louvain)
  - No need for embeddings
  
- ‚ùå **Cons**:
  - Limited to graph structure only
  - May miss latent similarities
  - Hard to incorporate node features

### Embedding-Based Methods (Node2Vec + Clustering):
- ‚úÖ **Pros**:
  - Can capture complex relationships
  - Flexible (works with any clustering algorithm)
  - Can incorporate node features
  - Enables similarity search
  
- ‚ùå **Cons**:
  - Requires additional step (embedding)
  - More parameters to tune
  - Clustering quality depends on embedding quality
  - May not optimize graph-specific metrics (modularity)

### When to Use What?
- **Louvain**: Default choice for most community detection tasks
- **Node2Vec + Clustering**: When you need node embeddings for other tasks, or want to incorporate additional features
- **Girvan-Newman**: When you need hierarchical community structure
- **DBSCAN/OPTICS**: When you expect communities of varying densities or want to identify outliers
