# Lesson 2: Finding Bridge Papers with Betweenness

**Duration:** ~10 minutes 
**Module:** 5 - GDS with Python 
**Dataset:** Cora Citation Network (continued)

## What You'll Learn

- What "bridge papers" are and why they matter
- How Betweenness Centrality identifies information bottlenecks
- How to compare PageRank (influence) with Betweenness (connectivity)
- How to interpret centrality scores in context

## Prerequisites

- Completion of Lesson 1 (data loaded, PageRank computed)
- Graph `cora-graph` should exist in memory


## Quick Setup Check

Verify you're connected and the graph exists from Lesson 1.


In [None]:
import pandas as pd
from IPython.display import display
from graphdatascience import GraphDataScience

# UPDATE THESE WITH YOUR NEO4J CREDENTIALS (same as Lesson 1)
bolt = "bolt://localhost:7687"
user = "neo4j"
password = "your-password"
auth = (user, password)

# Reconnect to GDS
gds = GraphDataScience(bolt, auth=auth, aura_ds=False)

# Get the graph object from Lesson 1
G = gds.graph.get("cora-graph")

print(f"Connected to GDS version: {gds.version()}")
print(f"Graph '{G.name()}' loaded: {G.node_count():,} nodes, {G.relationship_count():,} relationships")


**Note:** If the graph doesn't exist, go back to Lesson 1 and run all cells first!


## Part 1: What is Betweenness Centrality?

**The concept:**
- Measures how often a node appears on shortest paths between other nodes
- Identifies **information brokers** and **bottlenecks**
- Different from PageRank (influence) – Betweenness measures **connectivity**

**Citation network interpretation:**
- High Betweenness = **bridge paper** connecting different research areas
- These papers may not be the most cited
- But they're essential for knowledge transfer across fields

**Real-world analogy:**
- PageRank: The famous professor everyone knows
- Betweenness: The colleague who connects you to researchers in other departments


### Business Value of Betweenness

**In fraud detection:**
- Find accounts that connect fraud rings (we saw this in Module 2)

**In supply chain:**
- Identify critical distribution hubs vulnerable to disruption

**In citation networks:**
- Discover interdisciplinary papers that bridge research areas


## Part 2: Computing Betweenness Centrality

Run Betweenness Centrality on the citation graph.


In [None]:
# Run Betweenness Centrality and write results
bc_result = gds.betweenness.write(
 G,
 writeProperty='betweenness'
)

print(f"Computed Betweenness for {bc_result['nodePropertiesWritten']:,} papers")
print(f" Min score: {bc_result['centralityDistribution']['min']:.2f}")
print(f" Max score: {bc_result['centralityDistribution']['max']:.2f}")
print(f" Mean score: {bc_result['centralityDistribution']['mean']:.2f}")


**What just happened?**
- We computed Betweenness for every paper in the network
- The algorithm provides distribution statistics automatically
- High scores indicate papers on many shortest paths between others


## Part 3: Finding Bridge Papers

Query papers with highest Betweenness scores.


In [None]:
# Find top bridge papers
q_top_betweenness = """
MATCH (p:Paper)
WHERE p.betweenness IS NOT NULL
RETURN 
 p.paper_Id AS paperId,
 p.subject AS subject,
 p.betweenness AS betweenness,
 p.pageRank AS pageRank
ORDER BY p.betweenness DESC
LIMIT 10
"""

df_bridges = gds.run_cypher(q_top_betweenness)
print("Top 10 Bridge Papers:")
display(df_bridges)


**Interpretation:**
- Notice how PageRank and Betweenness highlight different papers
- High Betweenness papers act as connectors between research communities
- These are often methodological papers that apply to multiple domains


## Part 4: Comparing PageRank and Betweenness

Let's visualize the relationship between influence (PageRank) and connectivity (Betweenness).


In [None]:
import matplotlib.pyplot as plt

# Get all papers with both metrics
q_both_metrics = """
MATCH (p:Paper)
WHERE p.pageRank IS NOT NULL AND p.betweenness IS NOT NULL
RETURN p.pageRank AS pageRank, 
 p.betweenness AS betweenness,
 p.subject AS subject
"""

df_metrics = gds.run_cypher(q_both_metrics)

# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(df_metrics['pageRank'], df_metrics['betweenness'], alpha=0.5, s=10)
plt.xlabel('PageRank (Influence)', fontsize=12)
plt.ylabel('Betweenness Centrality (Connectivity)', fontsize=12)
plt.title('PageRank vs Betweenness in Citation Network', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Total papers plotted: {len(df_metrics):,}")


**What does this scatter plot tell us?**

- **Bottom-left cluster:** Most papers (low influence, low connectivity)
- **Top-right outliers:** "Superstar" papers (high influence AND high connectivity)
- **Top-left:** Bridge papers (high connectivity, moderate influence)
- **Bottom-right:** Influential but not connecting different areas

**Key insight:** Different algorithms reveal different types of importance!


## Part 5: Cross-Subject Bridge Analysis

Which subjects produce the most bridge papers?


In [None]:
# Analyze bridge papers by subject
q_subject_bridges = """
MATCH (p:Paper)
WHERE p.betweenness > 1000
RETURN 
 p.subject AS subject,
 count(p) AS num_bridge_papers,
 avg(p.betweenness) AS avg_betweenness,
 avg(p.pageRank) AS avg_pageRank
ORDER BY num_bridge_papers DESC
"""

df_subject_bridges = gds.run_cypher(q_subject_bridges)
print("Bridge Papers by Research Subject:")
display(df_subject_bridges)


**Business Insight:**

Subjects with many high-betweenness papers are **methodological fields** that influence multiple domains. In business:
- These are technologies/methods that enable cross-functional collaboration
- Investing in these areas has broader impact across the organization
- Teams working in these areas should be central to knowledge sharing


## Part 6: Finding Specific Cross-Subject Bridges

Let's find papers that specifically bridge different research areas.


In [None]:
# Find papers that cite across subjects AND have high betweenness
q_cross_subject = """
MATCH (p:Paper)-[:CITES]->(cited:Paper)
WHERE p.betweenness > 500
 AND p.subject <> cited.subject
WITH p, 
 collect(DISTINCT cited.subject) AS cited_subjects,
 count(DISTINCT cited.subject) AS num_subjects_cited
WHERE num_subjects_cited >= 3
RETURN 
 p.paper_Id AS paperId,
 p.subject AS subject,
 p.betweenness AS betweenness,
 p.pageRank AS pageRank,
 num_subjects_cited,
 cited_subjects
ORDER BY num_subjects_cited DESC, p.betweenness DESC
LIMIT 10
"""

df_cross_subject = gds.run_cypher(q_cross_subject)
print("Papers Bridging Multiple Research Areas:")
display(df_cross_subject)


**Why this matters:**

These papers are **truly interdisciplinary** – they:
1. Have high structural importance (Betweenness)
2. Actually cite work from 3+ different fields
3. Serve as knowledge transfer hubs

**In fraud detection:** These would be accounts connecting multiple fraud rings 
**In supply chain:** These would be critical distribution centers 
**In research:** These are the papers that synthesize knowledge across fields


## Summary: What You Accomplished

You've now mastered two fundamental centrality algorithms:

- Computed Betweenness Centrality to find bridge papers
- Compared PageRank (influence) with Betweenness (connectivity)
- Visualized the relationship between different centrality metrics
- Identified interdisciplinary papers connecting multiple fields
- Used pandas DataFrames and matplotlib for analysis

**Key insights:**
- **PageRank** identifies influential nodes (most cited papers)
- **Betweenness** identifies connectors (papers bridging research areas)
- Combining multiple algorithms reveals richer insights than any single metric

### Next Lesson

In Lesson 3, you'll use **Louvain** to detect research communities and prepare features for machine learning with **scaling** and embeddings.
