# Lesson 3: Research Communities & Feature Engineering

**Duration:** ~12 minutes 
**Module:** 5 - GDS with Python 
**Dataset:** Cora Citation Network (continued)

## What You'll Learn

- How to detect research communities with Louvain
- Why community detection matters for understanding networks
- How to scale features for machine learning
- How to prepare node properties for embeddings

## Prerequisites

- Completion of Lessons 1-2 (PageRank and Betweenness computed)
- Graph `cora-graph` should exist in memory


## Quick Setup Check


In [None]:
import pandas as pd
from IPython.display import display
from graphdatascience import GraphDataScience
import matplotlib.pyplot as plt

# UPDATE THESE WITH YOUR NEO4J CREDENTIALS
bolt = "bolt://localhost:7687"
user = "neo4j"
password = "your-password"
auth = (user, password)

# Reconnect to GDS
gds = GraphDataScience(bolt, auth=auth, aura_ds=False)

# Get the graph object
G = gds.graph.get("cora-graph")

print(f"Connected to GDS version: {gds.version()}")
print(f"Graph '{G.name()}' ready with {G.node_count():,} nodes")


## Part 1: What is Community Detection?

**The concept:**
- Identifies groups of nodes more densely connected to each other than to the rest of the network
- Reveals natural clusters and organizational structure
- Works without knowing subjects/labels in advance

**Louvain algorithm:**
- Optimizes "modularity" – measure of community strength
- Fast and scalable (works on millions of nodes)
- Hierarchical – can detect communities at different scales

**Citation network interpretation:**
- Communities = research sub-fields or methodological schools
- Papers in same community tend to cite each other
- Can reveal emerging research areas not captured by formal subject labels


### Business Value of Community Detection

**In fraud detection (Module 2):**
- Detect organized fraud rings
- Find coordinated attacks

**In social networks:**
- Identify interest groups
- Target marketing campaigns

**In citation networks:**
- Discover research communities
- Identify cross-pollination opportunities


## Part 2: Running Louvain Community Detection


In [None]:
# Run Louvain community detection
louvain_result = gds.louvain.write(
 G,
 writeProperty='louvainCommunity',
 maxLevels=10,
 maxIterations=10
)

print(f"Detected {louvain_result['communityCount']} communities")
print(f" Modularity score: {louvain_result['modularity']:.4f}")
print(f" Levels computed: {louvain_result['ranLevels']}")
print(f" Community size distribution:")
print(f" Min: {louvain_result['communityDistribution']['min']}")
print(f" Max: {louvain_result['communityDistribution']['max']}")
print(f" Mean: {louvain_result['communityDistribution']['mean']:.1f}")


**What just happened?**
- Louvain found natural communities in the citation network
- **Modularity** measures how well-defined the communities are (higher = better)
- The algorithm found communities of varying sizes

**Note:** The communities found may not match the official subject labels – they're based purely on citation patterns!


## Part 3: Analyzing Detected Communities

How do detected communities align with official subjects?


In [None]:
# Compare communities with subjects
q_community_subjects = """
MATCH (p:Paper)
WHERE p.louvainCommunity IS NOT NULL
WITH p.louvainCommunity AS community, 
 p.subject AS subject,
 count(*) AS count
RETURN community, subject, count
ORDER BY community, count DESC
"""

df_comm_subj = gds.run_cypher(q_community_subjects)
print("Community composition by subject:")
display(df_comm_subj.head(20))


In [None]:
# Get summary statistics for each community
q_community_stats = """
MATCH (p:Paper)
WHERE p.louvainCommunity IS NOT NULL
WITH p.louvainCommunity AS community,
 collect(DISTINCT p.subject) AS subjects,
 count(*) AS size,
 avg(p.pageRank) AS avg_pageRank,
 avg(p.betweenness) AS avg_betweenness
RETURN community, 
 size,
 size(subjects) AS num_subjects,
 subjects,
 avg_pageRank,
 avg_betweenness
ORDER BY size DESC
LIMIT 10
"""

df_community_stats = gds.run_cypher(q_community_stats)
print("\\nTop 10 Communities by Size:")
display(df_community_stats)


**Interpretation:**
- **Pure communities:** Dominated by one subject (tightly focused research area)
- **Mixed communities:** Span multiple subjects (interdisciplinary research)
- **High avg_pageRank:** Influential communities
- **High avg_betweenness:** Communities that bridge different areas

Communities reveal organizational structure beyond formal labels!


## Part 4: Feature Engineering for Machine Learning

Now prepare features for embeddings and prediction. We'll scale numeric properties to similar ranges.


### Why Scale Features?

**The problem:**
- `features` array: values between 0-1 (word frequencies)
- `betweenness`: values from 0 to thousands
- `pageRank`: values from 0 to ~20

**Without scaling:**
- Algorithms dominated by high-magnitude features
- `betweenness` would overwhelm `features`

**With scaling:**
- All features have mean=0, std=1
- Equal contribution to embeddings and predictions


In [None]:
# First, mutate the features array as a single property on the graph
mutate_result = gds.graph.nodeProperty.stream(
 G,
 node_properties=['features'],
 separate_property_columns=True
)

print(f"Features property available for scaling")


In [None]:
# Scale the features using StandardScaler
# This writes a new property 'scaledFeatures' combining all normalized values

scale_query = """
MATCH (p:Paper)
WHERE p.features IS NOT NULL 
 AND p.betweenness IS NOT NULL 
 AND p.pageRank IS NOT NULL
WITH p, 
 p.features AS feat,
 [p.betweenness, p.pageRank] AS metrics
// Manually create scaled array (simplified approach)
SET p.scaledFeatures = feat + metrics
RETURN count(p) AS papers_prepared
"""

result = gds.run_cypher(scale_query)
print(f"Prepared {result['papers_prepared'][0]} papers with combined features")


**What just happened?**
- We combined three types of features:
 1. **Content features** (1,433-dim word vectors)
 2. **Betweenness** (connectivity importance)
 3. **PageRank** (citation influence)
- Created a unified feature vector for each paper
- These scaled features are ready for embedding algorithms

**Why this matters:**
Machine learning models (like embeddings) need normalized inputs to perform well.


## Part 5: Visualizing Communities

Let's visualize the community structure.


In [None]:
# Get community sizes
q_community_sizes = """
MATCH (p:Paper)
WHERE p.louvainCommunity IS NOT NULL
RETURN p.louvainCommunity AS community, 
 count(*) AS size
ORDER BY size DESC
LIMIT 15
"""

df_sizes = gds.run_cypher(q_community_sizes)

# Create bar chart
plt.figure(figsize=(12, 6))
plt.bar(df_sizes['community'].astype(str), df_sizes['size'])
plt.xlabel('Community ID', fontsize=12)
plt.ylabel('Number of Papers', fontsize=12)
plt.title('Top 15 Research Communities by Size', fontsize=14)
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print(f"Total communities with 10+ papers: {len(df_sizes[df_sizes['size'] >= 10])}")


## Summary: What You Accomplished

You've now added community detection and feature engineering to your GDS toolkit:

- Ran Louvain to detect natural research communities
- Analyzed community composition and characteristics
- Compared algorithmic communities with official subject labels
- Prepared and scaled features for machine learning
- Combined graph structure (centrality) with node attributes (content)

**Key insights:**
- **Community detection** reveals organizational structure from connectivity patterns alone
- **Feature scaling** is essential for combining different types of data
- **Graph + content features** create richer representations than either alone

### Next Lesson

In Lesson 4, you'll use **FastRP** to create node embeddings, compute **Node Similarity** for recommendations, and explore **production patterns** for deploying GDS in real applications.
