# Lesson 1: Citation Networks & Setup

**Duration:** ~8 minutes 
**Module:** 5 - GDS with Python 
**Dataset:** Cora Citation Network

## What You'll Learn

- What citation networks reveal about scientific research
- How to install and connect the Python GDS client
- How to load real data programmatically
- How to create your first graph projection
- How to run PageRank to identify influential papers

## Prerequisites

- Neo4j instance running (Desktop, Sandbox, or Aura DS)
- Basic Python knowledge
- Completion of Modules 1-4 (understanding GDS concepts)


## Part 1: Installation & Setup

First, install the Python GDS client and required packages.


In [None]:
%%capture
%pip install graphdatascience pandas matplotlib


## What is a Citation Network?

Citation networks reveal:
- **Influential papers:** Which research shaped entire fields?
- **Bridge papers:** Which works connect different research areas?
- **Research communities:** Which papers form natural clusters?

**The Cora Dataset:**
- 2,708 academic papers
- 7 subject areas (Neural Networks, Reinforcement Learning, Theory, Genetic Algorithms, Case-Based Reasoning, Probabilistic Methods, Rule Learning)
- 10,556 citation relationships (Paper A cites Paper B)
- Each paper has content features (1,433-dimensional vector representing words used)

**Graph structure:**
- Nodes: `Paper` (with properties: `subject`, `features`, `subjectClass`)
- Relationships: `CITES` (directed: Paper A -> Paper B means A cites B)


## Part 2: Connection to Neo4j

**GitHub Codespaces users**: Your Neo4j instance is already running and `.env` is pre-configured!

**Local users**: Ensure you've updated the `.env` file in the repository root with your Neo4j connection details:
- For **Neo4j Sandbox**: Use the Bolt URL and credentials provided
- For **Aura DS**: Use your connection string
- For **Neo4j Desktop**: Usually `bolt://localhost:7687`

We'll load credentials from the `.env` file using `python-dotenv`.


In [None]:
import os
import pandas as pd
from IPython.display import display
from graphdatascience import GraphDataScience
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Get Neo4j credentials from environment variables
uri = os.getenv('NEO4J_URI')
username = os.getenv('NEO4J_USERNAME')
password = os.getenv('NEO4J_PASSWORD')

# Create GDS client
# Auto-detect Aura DS (URIs containing 'neo4j+s://')
aura_ds = 'neo4j+s://' in uri if uri else False
gds = GraphDataScience(uri, auth=(username, password), aura_ds=aura_ds)

# Verify connection and check GDS version
print(f"Connected to GDS version: {gds.version()}")
print(f"Connected to: {uri}")


**Success!** If you see a version number above, you're connected to Neo4j and ready to proceed.

**What just happened?**
- `load_dotenv()` loaded environment variables from the `.env` file
- `os.getenv()` retrieved your Neo4j connection details
- We created a `GraphDataScience` object that connects to your Neo4j database
- This object (`gds`) is your interface to all GDS functionality
- You can run Cypher queries, create projections, and execute algorithms all through this object

**Why load from .env?**
- Keeps credentials secure (never commit `.env` to git)
- Same credentials across all notebooks (no retyping!)
- Easy to switch between environments (local, Codespaces, Sandbox)


## Part 3: Loading the Citation Network

Now load the Cora citation dataset from GitHub. We'll use `gds.run_cypher()` to execute Cypher queries from Python.


In [None]:
# Load paper nodes (2,708 papers with subjects and features)
node_load_q = """
LOAD CSV WITH HEADERS FROM 
 "https://raw.githubusercontent.com/Kristof-Neys/Neo4j-Cora/main/node_list.csv" AS row
WITH toInteger(row.id) AS paperId, 
 row.subject AS subject, 
 row.features AS features
MERGE (p:Paper {paper_Id: paperId})
SET p.subject = subject, 
 p.features = apoc.convert.fromJsonList(features)
RETURN count(p) AS papers_loaded
"""

result = gds.run_cypher(node_load_q)
print(f"Loaded {result['papers_loaded'][0]} papers")


In [None]:
# Load citation relationships (10,556 citations)
edge_load_q = """
LOAD CSV WITH HEADERS FROM 
 "https://raw.githubusercontent.com/Kristof-Neys/Neo4j-Cora/main/edge_list.csv" AS row
MATCH (source:Paper {paper_Id: toInteger(row.source)})
MATCH (target:Paper {paper_Id: toInteger(row.target)})
MERGE (source)-[r:CITES]->(target)
RETURN count(r) AS citations_loaded
"""

result = gds.run_cypher(edge_load_q)
print(f"Loaded {result['citations_loaded'][0]} citations")


**What just happened?**
- We used `gds.run_cypher()` to execute Cypher queries from Python
- The queries loaded CSV data from GitHub into Neo4j
- Results came back as pandas DataFrames (notice the `['papers_loaded'][0]` syntax)

**Key insight:** `gds.run_cypher()` lets you run ANY Cypher query and get pandas DataFrames back!


In [None]:
# Quick peek at the data
q_peek = """
MATCH (n:Paper) 
WHERE n.features IS NOT NULL 
RETURN n.paper_Id AS PaperId, 
 n.subject AS Paper_Subject, 
 size(n.features) AS num_features
LIMIT 5
"""

df = gds.run_cypher(q_peek)
print("Sample papers from dataset:")
display(df)


In [None]:
# Convert subject labels to numeric values (needed for some algorithms)
query_subj = """
MATCH (p:Paper)
WITH collect(DISTINCT p.subject) AS listSubjects
WITH listSubjects, size(listSubjects) AS sizeListSubjects
WITH listSubjects, range(1, sizeListSubjects) AS rangeSubjects
WITH apoc.map.fromLists(listSubjects, rangeSubjects) AS mapSubjects
MATCH (p:Paper)
SET p.subjectClass = mapSubjects[p.subject]
RETURN count(p) AS papers_updated
"""

result = gds.run_cypher(query_subj)
print(f"Converted subjects to numeric for {result['papers_updated'][0]} papers")


## Part 4: Your First Graph Projection

Before running algorithms, we create an **in-memory graph projection**.

**Why project?**
- Algorithms run on optimized in-memory structure (much faster)
- Can include only relevant nodes/relationships
- Can configure orientation (directed vs undirected)
- Can include properties needed by algorithms


In [None]:
# Create a graph projection
G, result = gds.graph.project(
 "cora-graph", # Graph name
 {"Paper": { # Node projection
 "properties": ["subjectClass", "features"] # Include properties
 }},
 {"CITES": { # Relationship projection
 "orientation": "UNDIRECTED", # Treat citations as bidirectional
 "aggregation": "SINGLE" # Keep only one edge per pair
 }}
)

# Inspect the projected graph
print(f"Projected graph: {G.name()}")
print(f" Nodes: {G.node_count():,}")
print(f" Relationships: {G.relationship_count():,}")
print(f" Memory usage: {G.memory_usage()}")
print(f" Density: {G.density():.6f}")
print(f" Node properties: {G.node_properties('Paper')}")


**Why UNDIRECTED?**

Even though citations are directional (Paper A cites Paper B), for influence analysis we treat them as connections in both directions. This captures the network of **related research** rather than just influence flow.

**The Graph object:**
- `G` is a Python object representing your projected graph
- Has methods like `.node_count()`, `.memory_usage()`, etc.
- Much cleaner than querying the Graph Catalog with Cypher
- Can be passed directly to algorithm functions


### Projection Syntax Comparison

**In Browser (Cypher):**
```cypher
MATCH (source:Paper)-[r:CITES]->(target:Paper)
WITH gds.graph.project(
 'cora-graph',
 source,
 target,
 {relationshipType: type(r)},
 {undirectedRelationshipTypes: ['CITES']}
) AS g
RETURN g.graphName, g.nodeCount
```

**In Python:**
```python
G, result = gds.graph.project(
 'cora-graph',
 {'Paper': {'properties': ['subjectClass', 'features']}},
 {'CITES': {'orientation': 'UNDIRECTED'}}
)
```

**Much simpler!**


## Part 5: Finding Influential Papers with PageRank

Now run PageRank to identify the most influential papers in the citation network.

**What does PageRank measure?**
- Not just citation count
- Papers cited by OTHER influential papers score higher
- Reveals foundational papers that shaped fields


In [None]:
# Run PageRank and write results back to the database
PR_result = gds.pageRank.write(
 G, # The graph object
 writeProperty='pageRank', # Property name to store results
 maxIterations=20, # Number of iterations
 dampingFactor=0.85 # Standard damping factor
)

# Print algorithm results
print(f"Computed PageRank for {PR_result['nodePropertiesWritten']:,} papers")
print(f" Iterations ran: {PR_result['ranIterations']}")


**What just happened?**
- We ran PageRank on the projected graph `G`
- Results were written back to Neo4j as a property called `pageRank`
- The algorithm automatically reported how many nodes were processed

**Python vs Browser:**
- In Browser: `CALL gds.pageRank.write('graph-name', {config})`
- In Python: `gds.pageRank.write(G, **config)`
- Same algorithm, simpler syntax!


In [None]:
# Find top influential papers and what they cite
influential_papers = gds.run_cypher("""
MATCH (p:Paper)
WHERE p.pageRank > 5
MATCH (p)-[:CITES]->(cited:Paper)
WHERE p.subject <> cited.subject
RETURN 
 p.subject AS source_subject,
 p.pageRank AS source_pageRank,
 cited.subject AS cited_subject,
 cited.pageRank AS cited_pageRank
ORDER BY p.pageRank DESC 
LIMIT 10
""")

print("Top Influential Papers and Their Cross-Subject Citations:")
display(influential_papers)


## Interpretation

**What do these results mean?**

Papers with PageRank > 5 are highly influential. If a Neural Networks paper with PageRank 13.84 cites Reinforcement Learning papers, it's **foundational interdisciplinary work**.

**Business value:**
"If you're entering a research field, PageRank tells you which papers you MUST readâ€”not just popular papers, but truly foundational ones."

**Remember from Module 2:**
We used PageRank to find influential fraudsters. Same algorithm, different domain!


## Python Client vs Neo4j Browser

| Aspect | Browser/Cypher | Python Client |
|--------|----------------|---------------|
| **Data Loading** | Run queries in Browser console | `gds.run_cypher()` returns DataFrames |
| **Projection** | MATCH + WITH + gds.graph.project | `gds.graph.project(name, nodes, rels)` |
| **Algorithm** | CALL gds.pageRank.write(...) | `gds.pageRank.write(G, ...)` |
| **Results** | Cypher result table | pandas DataFrame/Series |
| **Graph Inspection** | Query Graph Catalog | `G.node_count()`, `G.memory_usage()` |
| **Workflow** | Interactive, visual | Programmatic, scriptable |
| **Best for** | Exploration, visualization | Production, ML pipelines |


## Summary: What You Accomplished

You've built a complete citation network analysis pipeline in Python:

- Connected to Neo4j with the Python GDS client
- Loaded 2,708 papers and 10,556 citations programmatically
- Created a graph projection with properties
- Ran PageRank to identify the most influential papers
- Queried results as pandas DataFrames

**Key insight:** Python client = same GDS power, simpler syntax, seamless Python ecosystem integration

### Next Lesson

In Lesson 2, you'll combine PageRank with **Betweenness Centrality** to find "bridge papers" that connect different research areas.
