# Community Detection Analysis: Friends Social Network üï∏Ô∏è

This notebook analyzes the community structure of a real social network (Twitter friends graph) using the **Louvain algorithm** for community detection.

## Overview

We'll explore:
- **Network Structure**: Understanding the topology of the friends network
- **Community Detection**: Applying the Louvain algorithm to identify communities
- **Community Analysis**: Examining the characteristics of detected communities
- **Visualizations**: Creating insightful visualizations of the network and communities
- **Insights**: Drawing meaningful conclusions about the network structure

## Dataset

The `friends.graphml` file contains a real social network graph with Twitter friendship connections, including various node attributes such as:
- User profiles (screen names, descriptions)
- Social metrics (followers, friends, statuses count)
- Profile information (location, verification status)


In [None]:
import warnings
import os
import sys

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import networkx as nx
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from networkx.algorithms.community.quality import modularity

warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("‚úÖ All libraries imported successfully!")
print(f"NetworkX version: {nx.__version__}")

## Part I: Loading and Exploring the Network

First, let's load the friends network graph and examine its basic properties.


In [None]:
# Load the graph from GraphML file
# The file is located in the centrality/data directory
graph_path = '../centrality/data/friends.graphml'

print(f"Loading graph from: {graph_path}")
G = nx.read_graphml(graph_path)

# Convert to undirected graph (friendship is typically bidirectional in social networks)
G = G.to_undirected()

# Remove self-loops if any
G.remove_edges_from(nx.selfloop_edges(G))

print(f"\n‚úÖ Graph loaded successfully!")
print(f"   Nodes: {G.number_of_nodes():,}")
print(f"   Edges: {G.number_of_edges():,}")
print(f"   Is directed: {G.is_directed()}")
print(f"   Is connected: {nx.is_connected(G)}")


### Network Basic Statistics

Let's compute some fundamental network metrics to understand the structure better.


In [None]:
# Compute basic network statistics
density = nx.density(G)
avg_clustering = nx.average_clustering(G)
avg_degree = sum(dict(G.degree()).values()) / G.number_of_nodes()

# Check connectivity
if not nx.is_connected(G):
    components = list(nx.connected_components(G))
    largest_component = max(components, key=len)
    largest_component_size = len(largest_component)
    print(f"‚ö†Ô∏è  Graph is not fully connected")
    print(f"   Number of connected components: {len(components)}")
    print(f"   Largest component size: {largest_component_size:,} nodes ({100*largest_component_size/G.number_of_nodes():.2f}%)")
else:
    print(f"‚úÖ Graph is fully connected")

print(f"\nüìä Network Statistics:")
print(f"   Density: {density:.6f}")
print(f"   Average clustering coefficient: {avg_clustering:.4f}")
print(f"   Average degree: {avg_degree:.2f}")

# Degree distribution
degrees = [d for n, d in G.degree()]
print(f"\nüìà Degree Distribution:")
print(f"   Min degree: {min(degrees)}")
print(f"   Max degree: {max(degrees)}")
print(f"   Median degree: {np.median(degrees):.2f}")
print(f"   Standard deviation: {np.std(degrees):.2f}")

### Exploring Node Attributes

Let's see what information is available about the nodes in the network.


In [None]:
# Examine node attributes
if G.nodes():
    sample_node = list(G.nodes())[0]
    sample_attrs = G.nodes[sample_node]
    
    print(f"üìã Sample node attributes:")
    print(f"   Node ID: {sample_node}")
    print(f"   Available attributes: {len(sample_attrs)}")
    
    # Show some interesting attributes
    interesting_attrs = ['label', 'name', 'screen_name', 'followers_count', 
                        'friends_count', 'verified', 'location', 'description']
    
    print(f"\n   Key attributes found:")
    for attr in interesting_attrs:
        if attr in sample_attrs:
            value = sample_attrs[attr]
            if isinstance(value, str) and len(str(value)) > 50:
                value = str(value)[:50] + "..."
            print(f"     - {attr}: {value}")
    
    # Create a helper function to extract display names
    def get_display_name(node_id, node_attrs):
        """Extract display name from node attributes."""
        name_attrs = ['label', 'Label', 'name', 'Name', 'screen_name', 'screenName']
        for attr in name_attrs:
            if attr in node_attrs and node_attrs[attr]:
                name = str(node_attrs[attr]).strip()
                if name and name != node_id:
                    return name
        return str(node_id)
    
    # Build name mapping
    node_to_name = {}
    for node_id in G.nodes():
        node_attrs = G.nodes[node_id]
        node_to_name[node_id] = get_display_name(node_id, node_attrs)
    
    print(f"\n‚úÖ Name mapping created for {len(node_to_name):,} nodes")

### Apply Louvain algorithm for community detection

In [None]:
# seed=42 for reproducibility
partition = nx.community.louvain_communities(G, resolution=2, seed=42, weight=None)

# Create a mapping from node to community ID
node_to_community = {}
for comm_id, community in enumerate(partition):
    for node in community:
        node_to_community[node] = comm_id

# Calculate modularity
modularity_score = modularity(G, partition)

print(f"‚úÖ Community detection complete!")
print(f"\nüìä Results:")
print(f"   Number of communities detected: {len(partition):,}")
print(f"   Modularity score: {modularity_score:.4f}")
print(f"   Average community size: {G.number_of_nodes() / len(partition):.2f} nodes")

### Community Size Distribution

Let's examine the distribution of community sizes to understand how the network is structured.


In [None]:
# Analyze community sizes
community_sizes = [len(comm) for comm in partition]
community_sizes_sorted = sorted(community_sizes, reverse=True)

print(f"üìä Community Size Statistics:")
print(f"   Largest community: {max(community_sizes):,} nodes")
print(f"   Smallest community: {min(community_sizes):,} nodes")
print(f"   Median community size: {np.median(community_sizes):.2f} nodes")
print(f"   Mean community size: {np.mean(community_sizes):.2f} nodes")

# Show top 10 largest communities
print(f"\nüèÜ Top 10 Largest Communities:")
for i, size in enumerate(community_sizes_sorted[:10], 1):
    percentage = 100 * size / G.number_of_nodes()
    print(f"   {i:2d}. {size:6,} nodes ({percentage:5.2f}% of network)")

# Create a DataFrame for easier analysis
community_df = pd.DataFrame({
    'community_id': range(len(partition)),
    'size': community_sizes
})

# Visualize community size distribution
fig = px.histogram(
    community_df, 
    x='size',
    nbins=50,
    title='Distribution of Community Sizes',
    labels={'size': 'Community Size (number of nodes)', 'count': 'Number of Communities'},
    color_discrete_sequence=['#2E86AB']
)

fig.update_layout(
    showlegend=False,
    height=500,
    width=900
)

fig.show()

### Community Characteristics Analysis

Let's analyze the properties of different communities, such as their internal connectivity and relationship to network metrics.


In [None]:
# Analyze community characteristics
community_stats = []

for comm_id, community in enumerate(partition):
    # Create subgraph for this community
    subgraph = G.subgraph(community)
    
    # Compute metrics
    num_nodes = len(community)
    num_edges = subgraph.number_of_edges()
    
    # Internal density (edges within community / possible edges)
    possible_edges = num_nodes * (num_nodes - 1) / 2
    internal_density = num_edges / possible_edges if possible_edges > 0 else 0
    
    # Average degree within community
    avg_degree = 2 * num_edges / num_nodes if num_nodes > 0 else 0
    
    # Count edges connecting to other communities
    edges_to_other = 0
    for node in community:
        neighbors = list(G.neighbors(node))
        edges_to_other += sum(1 for n in neighbors if n not in community)
    
    # Conductance (edges to other communities / total edges from community)
    total_edges_from_comm = num_edges * 2 + edges_to_other
    conductance = edges_to_other / total_edges_from_comm if total_edges_from_comm > 0 else 0
    
    community_stats.append({
        'community_id': comm_id,
        'size': num_nodes,
        'internal_edges': num_edges,
        'internal_density': internal_density,
        'avg_degree': avg_degree,
        'edges_to_other': edges_to_other,
        'conductance': conductance
    })

# Create DataFrame
comm_stats_df = pd.DataFrame(community_stats)

# Sort by size
comm_stats_df = comm_stats_df.sort_values('size', ascending=False)

print("üìä Community Characteristics (Top 10 Largest Communities):")
print(comm_stats_df.head(10).to_string(index=False))

# Summary statistics
print(f"\nüìà Summary Statistics Across All Communities:")
print(f"   Average internal density: {comm_stats_df['internal_density'].mean():.4f}")
print(f"   Average conductance: {comm_stats_df['conductance'].mean():.4f}")
print(f"   Average internal degree: {comm_stats_df['avg_degree'].mean():.2f}")


### Visualization 1: Community Size vs Internal Density

This scatter plot shows the relationship between community size and how densely connected nodes are within each community.


In [None]:
# Create scatter plot: Community size vs Internal density
fig = px.scatter(
    comm_stats_df,
    x='size',
    y='internal_density',
    size='internal_edges',
    color='conductance',
    hover_data=['community_id', 'avg_degree'],
    title='Community Size vs Internal Density',
    labels={
        'size': 'Community Size (number of nodes)',
        'internal_density': 'Internal Density',
        'conductance': 'Conductance (lower is better)',
        'internal_edges': 'Internal Edges'
    },
    color_continuous_scale='Viridis',
    height=600,
    width=1000
)

fig.update_traces(
    marker=dict(opacity=0.7, line=dict(width=0.5, color='DarkSlateGrey'))
)

fig.update_layout(
    title_font_size=16,
    hovermode='closest'
)

fig.show()

print("üí° Insights:")
print("   - Larger communities tend to have lower internal density (sparser connections)")
print("   - High internal density indicates tightly-knit groups")
print("   - Low conductance means communities are well-separated from the rest of the network")


### Visualization 2: Network Layout with Communities

For smaller networks or subgraphs, we can visualize the network structure with nodes colored by their community membership. For large networks, we'll work with a sample or the largest component.


In [None]:
# For visualization, we'll work with a sample if the network is too large
# or focus on the largest component
MAX_NODES_FOR_VISUALIZATION = 500

if G.number_of_nodes() > MAX_NODES_FOR_VISUALIZATION:
    print(f"‚ö†Ô∏è  Network is large ({G.number_of_nodes():,} nodes)")
    print(f"   Creating visualization with a strategic sample...\n")
    
    # Strategy: Sample nodes from largest communities to preserve structure
    # Get top communities by size
    top_communities = sorted(partition, key=len, reverse=True)[:10]
    
    # Sample nodes from each top community
    sample_nodes = set()
    nodes_per_comm = MAX_NODES_FOR_VISUALIZATION // len(top_communities)
    
    for comm in top_communities:
        if len(comm) <= nodes_per_comm:
            sample_nodes.update(comm)
        else:
            sample_nodes.update(np.random.choice(list(comm), nodes_per_comm, replace=False))
    
    # Also include neighbors to preserve some structure
    extended_sample = set(sample_nodes)
    for node in list(sample_nodes)[:100]:  # Limit to avoid explosion
        neighbors = list(G.neighbors(node))
        extended_sample.update(np.random.choice(neighbors, min(5, len(neighbors)), replace=False))
    
    # Create subgraph
    G_viz = G.subgraph(extended_sample).copy()
    
    # Recompute communities for subgraph (or filter existing)
    partition_viz = []
    for comm in partition:
        comm_filtered = [n for n in comm if n in G_viz.nodes()]
        if len(comm_filtered) > 0:
            partition_viz.append(comm_filtered)
    
    print(f"   Sample size: {G_viz.number_of_nodes():,} nodes")
    print(f"   Communities in sample: {len(partition_viz)}")
else:
    G_viz = G
    partition_viz = partition
    print(f"‚úÖ Using full network for visualization ({G_viz.number_of_nodes():,} nodes)")

# Create node-to-community mapping for visualization
node_to_comm_viz = {}
for comm_id, community in enumerate(partition_viz):
    for node in community:
        node_to_comm_viz[node] = comm_id

# Compute layout
print(f"\nüìê Computing network layout (this may take a moment)...")
pos = nx.spring_layout(G_viz, k=1, iterations=50, seed=42)

print(f"‚úÖ Layout computed!")


In [None]:
# Create matplotlib visualization
plt.figure(figsize=(16, 12))

# Get community colors
num_communities = len(partition_viz)
cmap = cm.get_cmap('tab20', num_communities)

# Assign colors to nodes
node_colors = [node_to_comm_viz.get(node, 0) for node in G_viz.nodes()]

# Draw edges (lighter, thinner)
nx.draw_networkx_edges(
    G_viz, pos,
    alpha=0.1,
    width=0.3,
    edge_color='gray'
)

# Draw nodes (colored by community)
nx.draw_networkx_nodes(
    G_viz, pos,
    node_size=30,
    node_color=node_colors,
    cmap=cmap,
    alpha=0.8,
    linewidths=0.5,
    edgecolors='white'
)

# Optionally add labels for very small graphs
if G_viz.number_of_nodes() < 100:
    # Use display names if available
    labels = {node: node_to_name.get(node, str(node)[:10]) for node in G_viz.nodes()}
    nx.draw_networkx_labels(G_viz, pos, labels, font_size=6, alpha=0.7)

plt.title(f'Network Communities (Louvain Algorithm)\n{G_viz.number_of_nodes():,} nodes, {len(partition_viz)} communities', 
          fontsize=16, fontweight='bold', pad=20)
plt.axis('off')
plt.tight_layout()
plt.show()

print(f"\n‚úÖ Network visualization complete!")
print(f"   Each color represents a different community")
print(f"   Nodes in the same community are more densely connected to each other")

### Visualization 3: Interactive Community Comparison

Let's create an interactive visualization comparing community sizes and properties.


In [None]:
# Create an interactive bar chart of top communities
top_n = min(20, len(comm_stats_df))

fig = go.Figure()

fig.add_trace(go.Bar(
    x=comm_stats_df.head(top_n)['community_id'],
    y=comm_stats_df.head(top_n)['size'],
    marker=dict(
        color=comm_stats_df.head(top_n)['internal_density'],
        colorscale='Viridis',
        showscale=True,
        colorbar=dict(title="Internal<br>Density")
    ),
    text=comm_stats_df.head(top_n)['size'],
    textposition='outside',
    hovertemplate='<b>Community %{x}</b><br>' +
                  'Size: %{y:,} nodes<br>' +
                  'Internal Density: %{marker.color:.3f}<br>' +
                  '<extra></extra>'
))

fig.update_layout(
    title=f'Top {top_n} Largest Communities',
    xaxis_title='Community ID',
    yaxis_title='Number of Nodes',
    height=600,
    width=1200,
    showlegend=False
)

fig.show()


## Part IV: Detailed Community Analysis

Let's dive deeper into specific communities to understand their characteristics.


In [None]:
# Analyze top communities in detail
print("="*70)
print("DETAILED ANALYSIS OF TOP 10 COMMUNITIES")
print("="*70)

for i, (comm_id, community) in enumerate(sorted(enumerate(partition), key=lambda x: len(x[1]), reverse=True)[:10], 1):
    subgraph = G.subgraph(community)
    
    print(f"\nüèÜ Community {comm_id} (Rank #{i})")
    print(f"   Size: {len(community):,} nodes")
    print(f"   Internal edges: {subgraph.number_of_edges():,}")
    print(f"   Internal density: {comm_stats_df[comm_stats_df['community_id'] == comm_id]['internal_density'].values[0]:.4f}")
    print(f"   Average degree: {comm_stats_df[comm_stats_df['community_id'] == comm_id]['avg_degree'].values[0]:.2f}")
    
    # Find nodes with highest degree in this community
    degrees_in_comm = {node: G.degree(node) for node in community}
    top_nodes = sorted(degrees_in_comm.items(), key=lambda x: x[1], reverse=True)[:5]
    
    print(f"   Top nodes by degree:")
    for node, degree in top_nodes:
        name = node_to_name.get(node, str(node))
        if len(name) > 30:
            name = name[:30] + "..."
        print(f"     - {name}: {degree} connections")
    
    # Check if we have additional attributes
    if 'followers_count' in G.nodes[list(community)[0]]:
        followers = [G.nodes[node].get('followers_count', 0) for node in community 
                    if G.nodes[node].get('followers_count') is not None]
        if followers:
            print(f"   Average followers (if available): {np.mean(followers):.0f}")


#### Describe clusters thanks to a LLM

| Rank | Community | Size | Avg Degree | Density | Avg Followers | Top Influencers (Top 3) | People Insight Summary |
|------|------------|------|-------------|----------|----------------|--------------------------|------------------------|
| ü•á #1 | 22 | 453 | 37.65 | 0.0833 | 82,152 | Axelle Lemaire, Maitre Eolas, Nicolas Loubet | French socio-political and digital culture community. Mix of journalists, politicians, and humor accounts like *Le Gorafi*. Likely focused on French politics, digital rights, and societal commentary. |
| ü•à #2 | 11 | 337 | 36.52 | 0.1087 | 57,409 | Freakonometrics, Evpok, Boulet | Academic and science communication network, blending data scientists, statisticians, and science illustrators. Strong rationalist and educational tone with humor and analysis. |
| ü•â #3 | 25 | 318 | 32.23 | 0.1017 | 507,821 | Elon Musk, Edward Snowden, Tim Cook | Global tech and innovation cluster centered around tech leaders, privacy advocates, and major companies. Focus on AI, privacy, innovation, and digital transformation. |
| #4 | 19 | 175 | 24.01 | 0.1380 | 119,127 | Chris Anderson, TED Talks, Laurence Vachon | TED-related intellectual cluster. Mix of educators, scientists, and thought leaders spreading ideas and inspirational content. Emphasis on innovation, science, and social impact. |
| #5 | 14 | 164 | 80.99 | 0.4969 | 11,396 | Nando de Freitas, Oriol Vinyals, Ilya Sutskever | Deep learning research core community ‚Äî highly connected AI researchers and engineers. Focus on cutting-edge AI, neural networks, and machine learning theory. |
| #6 | 15 | 124 | 34.95 | 0.2842 | 10,899 | DeepMind, Fei-Fei Li, Kai Arulkumaran | Applied AI and robotics research cluster. Mix of academic and industry leaders working on AI ethics, computer vision, and intelligent systems. |
| #7 | 7 | 123 | 12.00 | 0.0984 | 107,831 | Gokula Krishnan, XKCD, Simone Giertz | STEM humor and maker culture community. Scientists, engineers, and creators sharing comics, DIY robotics, and academic humor. Strong creative-science crossover. |
| #8 | 4 | 112 | 35.07 | 0.3160 | 4,546 | Hugo Larochelle, Olivier Grisel, NeurIPS Conference | Machine learning conference and research-focused network. Participants include top ML researchers and conference organizers. Discusses papers, benchmarks, and open research. |
| #9 | 18 | 98 | 23.08 | 0.2380 | 80,606 | Guillaume Meurice, Olivier B√©nis, Sale Con | French satire and cultural commentary cluster. Comedians, radio hosts, and satirical media accounts. Humor and critique of social/political issues. |
| #10 | 17 | 96 | 14.56 | 0.1533 | 81,344 | hardmaru, Alex J. Champandard, Eirini Malliaraki | AI art and creative tech cluster. Researchers and artists at the intersection of machine learning and creativity. Focus on generative art, AI aesthetics, and computational creativity. |


### Community Connectivity Matrix

Let's examine how communities are connected to each other.


In [None]:
# Create inter-community connectivity matrix
num_communities = len(partition)
connectivity_matrix = np.zeros((num_communities, num_communities))

for edge in G.edges():
    node1, node2 = edge
    comm1 = node_to_community[node1]
    comm2 = node_to_community[node2]
    
    if comm1 != comm2:
        connectivity_matrix[comm1, comm2] += 1
        connectivity_matrix[comm2, comm1] += 1  # Symmetric

# Focus on top communities for visualization
top_comm_ids = comm_stats_df.head(15)['community_id'].values
top_comm_indices = [list(partition).index(partition[cid]) for cid in top_comm_ids]
connectivity_submatrix = connectivity_matrix[np.ix_(top_comm_indices, top_comm_indices)]

# Create heatmap
fig = go.Figure(data=go.Heatmap(
    z=connectivity_submatrix,
    x=[f'Comm {cid}' for cid in top_comm_ids],
    y=[f'Comm {cid}' for cid in top_comm_ids],
    colorscale='YlOrRd',
    text=np.round(connectivity_submatrix, 0),
    texttemplate='%{text}',
    textfont={"size": 8},
    hovertemplate='Community %{y} ‚Üî Community %{x}<br>Edges: %{z:.0f}<extra></extra>'
))

fig.update_layout(
    title='Inter-Community Connectivity (Top 15 Communities)',
    xaxis_title='Community',
    yaxis_title='Community',
    height=700,
    width=800
)

fig.show()

print("üí° Insights:")
print("   - Diagonal elements are 0 (no self-connections between communities)")
print("   - Higher values indicate stronger connections between communities")
print("   - Well-separated communities would show low inter-community connectivity")

## Part V: Key Insights and Conclusions

Let's summarize the key findings from our community detection analysis.


In [None]:
print(f"\n COMMUNITY QUALITY:")
avg_internal_density = comm_stats_df['internal_density'].mean()
avg_conductance = comm_stats_df['conductance'].mean()
print(f"   ‚Ä¢ Average internal density: {avg_internal_density:.4f}")
print(f"   ‚Ä¢ Average conductance: {avg_conductance:.4f}")
if avg_conductance < 0.3:
    print(f"     ‚Üí Communities are well-separated (low conductance)")
else:
    print(f"     ‚Üí Communities have significant inter-connections")

print(f"\n INTERPRETATION:")
if modularity_score > 0.3 and avg_conductance < 0.3:
    print(f"   ‚úì The network exhibits strong community structure")
    print(f"   ‚úì Communities are well-defined and relatively isolated")
    print(f"   ‚úì This suggests distinct groups or clusters in the social network")
elif modularity_score > 0.2:
    print(f"   ‚Ä¢ The network shows moderate community structure")
    print(f"   ‚Ä¢ Some communities are well-defined, others are more interconnected")
else:
    print(f"   ‚Ä¢ The network has weak community structure")
    print(f"   ‚Ä¢ Connections are more distributed across the network")

---

## Conclusion

This analysis successfully applied the **Louvain algorithm** to detect communities in the friends social network. The results reveal:

- **Community Structure**: The network contains distinct communities with varying sizes and internal connectivity
- **Modularity**: The modularity score indicates the quality of the community partition
- **Network Insights**: Understanding community structure helps identify groups, influence patterns, and network dynamics

### Next Steps

Potential extensions of this analysis:
- Compare with other community detection algorithms (Girvan-Newman, Infomap, Leiden)
- Analyze temporal evolution of communities (if temporal data is available)
- Combine with centrality measures to identify community leaders
- Explore node attributes to understand what defines each community
- Apply community detection to subgraphs or filtered networks

### References

- **Louvain Algorithm**: Blondel, V. D., et al. (2008). "Fast unfolding of communities in large networks"
- **Modularity**: Newman, M. E. (2006). "Modularity and community structure in networks"
- **NetworkX Documentation**: https://networkx.org/
