# Lesson 4: Similarity & Prediction

**Duration:** ~15 minutes 
**Module:** 5 - GDS with Python 
**Dataset:** Cora Citation Network (continued)

## What You'll Learn

- How to create node embeddings with FastRP
- How to compute similarity between papers
- How to make link predictions (recommend citations)
- Production patterns for GDS workflows
- How to clean up graph projections

## Prerequisites

- Completion of Lessons 1-3 (all centrality, communities, and features computed)
- Graph `cora-graph` should exist in memory


## Quick Setup Check


In [None]:
import os
import pandas as pd
import numpy as np
from IPython.display import display
from graphdatascience import GraphDataScience
from dotenv import load_dotenv
import matplotlib.pyplot as plt

# Load credentials from .env
load_dotenv()
uri = os.getenv('NEO4J_URI')
username = os.getenv('NEO4J_USERNAME')
password = os.getenv('NEO4J_PASSWORD')

# Connect to GDS
aura_ds = 'neo4j+s://' in uri if uri else False
gds = GraphDataScience(uri, auth=(username, password), aura_ds=aura_ds)

# Get the graph object
G = gds.graph.get("cora-graph")

print(f"Connected to GDS version: {gds.version()}")
print(f"Graph '{G.name()}' ready: {G.node_count():,} nodes")
print(f"Node properties: {G.node_properties('Paper')}")


## Part 1: What are Node Embeddings?

**The concept:**
- Transform nodes into dense vector representations (embeddings)
- Similar nodes -> similar vectors
- Enables machine learning on graphs (clustering, classification, link prediction)

**FastRP (Fast Random Projection):**
- Creates embeddings from graph structure + node properties
- Much faster than deep learning approaches (e.g., GraphSAGE)
- Works well on large graphs (millions of nodes)
- Captures both local and global graph structure

**What influences the embedding?**
1. **Graph structure:** Papers citing similar papers get similar embeddings
2. **Node features:** Papers with similar content get similar embeddings
3. **Centrality:** We included PageRank & Betweenness in our features!


### Business Value of Embeddings

**In e-commerce:**
- Product recommendations based on user behavior graph

**In fraud detection:**
- Classify suspicious accounts based on transaction patterns

**In citation networks:**
- Recommend papers for researchers to read
- Predict future citations (link prediction)
- Cluster papers into research topics


## Part 2: Creating Embeddings with FastRP

Run FastRP using the scaled features from Lesson 3.


In [None]:
# Run FastRP to create 128-dimensional embeddings
# Use MUTATE mode to keep embeddings in memory (not written to database yet)

fastrp_result = gds.fastRP.mutate(
 G,
 mutateProperty='embedding',
 embeddingDimension=128,
 featureProperties=['scaledFeatures'], # Use our prepared features
 randomSeed=42, # For reproducibility
 iterationWeights=[0.0, 1.0, 1.0] # Weight more recent iterations
)

print(f"Created {fastrp_result['nodePropertiesWritten']} embeddings")
print(f" Embedding dimension: {embeddingDimension}")
print(f" Based on {len(['scaledFeatures'])} feature properties")


**What just happened?**
- FastRP created a 128-dimensional vector for each paper
- Used **MUTATE** mode (keeps embeddings in the projection, doesn't write to DB)
- Combined graph structure with content features
- `iterationWeights` controls how much to weight different "hops" in the graph

**Why MUTATE instead of WRITE?**
- Faster (no database write)
- Useful for intermediate results
- Can still stream/write later if needed


## Part 3: Computing Node Similarity

Now use the embeddings to find similar papers.


In [None]:
# Run Node Similarity to find the top-5 most similar papers for each paper
# Use STREAM mode to get results directly without writing

similarity_stream = gds.nodeSimilarity.stream(
 G,
 nodeProperties=['embedding'],
 topK=5, # Top 5 most similar papers per paper
 similarityCutoff=0.5 # Only pairs with similarity > 0.5
)

# Convert to DataFrame
df_similarity = pd.DataFrame(similarity_stream)

print(f"Computed {len(df_similarity):,} similarity pairs")
print(f" Average similarity: {df_similarity['similarity'].mean():.3f}")
print(f" Max similarity: {df_similarity['similarity'].max():.3f}")
print("\nSample similarity pairs:")
display(df_similarity.head(10))


**Understanding the results:**
- `node1`, `node2`: Neo4j internal IDs
- `similarity`: Cosine similarity (0-1, higher = more similar)
- Each paper gets top-5 recommendations
- Only pairs above 0.5 similarity are returned

**This is the foundation of recommendation systems!**


## Part 4: Paper Recommendations

Let's make this concrete by recommending papers for a specific paper.


In [None]:
# Pick a high-PageRank paper and find similar papers
q_recommendations = """
MATCH (source:Paper)
WHERE source.pageRank > 10
WITH source
ORDER BY source.pageRank DESC
LIMIT 1

// Get source paper details
WITH source, 
 id(source) AS sourceId,
 source.paper_Id AS source_paperId,
 source.subject AS source_subject,
 source.pageRank AS source_pageRank

// Find papers similar to this source (using pre-computed similarity)
MATCH (source)-[:SIMILAR]-(similar:Paper)
WHERE similar.paper_Id <> source.paper_Id

RETURN 
 source_paperId,
 source_subject,
 source_pageRank,
 collect({
 paperId: similar.paper_Id,
 subject: similar.subject,
 pageRank: similar.pageRank
 })[..5] AS recommendations
"""

# Note: This query assumes SIMILAR relationships were written
# For this demo, let's query differently using the stream results

# Get a specific paper and its neighbors
sample_paper_id = 100

q_recommend = f"""
MATCH (p:Paper {{paper_Id: {sample_paper_id}}})
RETURN 
 p.paper_Id AS paperId,
 p.subject AS subject,
 p.pageRank AS pageRank,
 p.betweenness AS betweenness,
 p.louvainCommunity AS community
"""

source_paper = gds.run_cypher(q_recommend)
print("Source Paper:")
display(source_paper)

# Find similar papers from our similarity stream
# (In production, you'd write these as relationships)
print(f"\\nTop 5 Papers Most Similar to Paper {sample_paper_id}:")
print("(Based on combined structure + content + centrality)")


**Interpretation:**

Similar papers might:
1. **Same subject** -> Related research
2. **Different subject** -> Interdisciplinary connections
3. **Similar centrality** -> Papers with similar structural importance

This is more sophisticated than just "papers in the same category"!


## Part 5: Clustering with K-Means

Use the embeddings to cluster papers into research topics.


In [None]:
# Run K-Means clustering on the embeddings
# We know there are 7 official subjects, so let's try k=7

kmeans_result = gds.kmeans.write(
 G,
 nodeProperty='embedding',
 k=7, # 7 clusters (same as number of subjects)
 writeProperty='cluster',
 randomSeed=42
)

print(f"Created {kmeans_result['communityDistribution']['max'] + 1} clusters")
print(f" Silhouette score: {kmeans_result.get('silhouette', 'N/A')}")
print(f" Cluster sizes: min={kmeans_result['communityDistribution']['min']}, max={kmeans_result['communityDistribution']['max']}, mean={kmeans_result['communityDistribution']['mean']:.1f}")


In [None]:
# Compare K-Means clusters with official subjects
q_cluster_analysis = """
MATCH (p:Paper)
WHERE p.cluster IS NOT NULL
WITH p.cluster AS cluster,
 p.subject AS subject,
 count(*) AS count
RETURN cluster, subject, count
ORDER BY cluster, count DESC
"""

df_clusters = gds.run_cypher(q_cluster_analysis)
print("\nK-Means Cluster Composition:")
display(df_clusters.head(25))


**Key insight:**

K-Means clusters papers based on:
- **Graph structure** (citation patterns)
- **Content similarity** (word features) 
- **Centrality** (PageRank + Betweenness)

Compare with **Louvain** from Lesson 3:
- **Louvain:** Pure graph structure (modularity)
- **K-Means on embeddings:** Structure + content + centrality

Different algorithms -> different perspectives on the same data!


## Part 6: Production Patterns

Best practices for using GDS in real applications.


### Pattern 1: Estimate Memory Before Running

Always estimate memory for large graphs!


In [None]:
# Estimate memory for PageRank on a large graph
estimate = gds.pageRank.write.estimate(
 G,
 writeProperty='pageRank_test'
)

print("PageRank Memory Estimate:")
print(f" Min bytes: {estimate['minMemoryUsage']}")
print(f" Max bytes: {estimate['maxMemoryUsage']}")
print(f" Node count: {estimate['nodeCount']:,}")
print(f" Relationship count: {estimate['relationshipCount']:,}")


### Pattern 2: Use STREAM for Exploration, WRITE for Production

**STREAM mode:**
- Returns results as pandas DataFrame
- No database writes
- Good for exploration and testing

**MUTATE mode:**
- Stores results in projection (not database)
- Good for intermediate steps in pipeline

**WRITE mode:**
- Stores results as node/relationship properties
- Persists for future queries
- Good for production


### Pattern 3: List and Clean Up Projections

Always clean up projections when done!


In [None]:
# List all graph projections
projections = gds.graph.list()
print("Current graph projections:")
display(projections[['graphName', 'nodeCount', 'relationshipCount', 'memoryUsage']])


In [None]:
# To drop a projection when you're done:
# (Uncomment to run)

# gds.graph.drop(G)
# print("Dropped projection")


### Pattern 4: Batch Processing Pipeline

Example production workflow:


In [None]:
def citation_analysis_pipeline(gds_client, graph_name="citation-pipeline"):
 """
 Complete citation network analysis pipeline.
 """
 print("Step 1: Project graph")
 G, _ = gds_client.graph.project(
 graph_name,
 {"Paper": {"properties": ["features", "subjectClass"]}},
 {"CITES": {"orientation": "UNDIRECTED"}}
 )
 
 print("Step 2: Compute centrality metrics")
 gds_client.pageRank.mutate(G, mutateProperty='pr')
 gds_client.betweenness.mutate(G, mutateProperty='bc')
 
 print("Step 3: Detect communities")
 gds_client.louvain.mutate(G, mutateProperty='community')
 
 print("Step 4: Create embeddings")
 gds_client.fastRP.mutate(
 G,
 mutateProperty='emb',
 embeddingDimension=128,
 featureProperties=['features']
 )
 
 print("Step 5: Compute similarity")
 similarity = gds_client.nodeSimilarity.stream(
 G,
 nodeProperties=['emb'],
 topK=10
 )
 
 print("Step 6: Clean up")
 gds_client.graph.drop(G)
 
 return pd.DataFrame(similarity)

# This pattern is perfect for scheduled batch jobs!
print("\nProduction Pipeline Template Created ")


## Summary: What You Accomplished

You've completed the entire Python GDS workflow!

- Created node embeddings with FastRP (combining structure + content)
- Computed node similarity for recommendations
- Clustered papers with K-Means
- Learned production patterns (estimate, stream/mutate/write, cleanup)
- Built a complete analysis pipeline

**Module 5 Complete!**

### What You've Learned Across All Lessons

**Lesson 1:** Setup, data loading, projections, PageRank 
**Lesson 2:** Betweenness Centrality, bridge analysis 
**Lesson 3:** Louvain communities, feature engineering 
**Lesson 4:** FastRP embeddings, similarity, K-Means, production patterns

### Key Takeaways

1. **Python GDS client** simplifies graph data science workflows
2. **Combine multiple algorithms** for richer insights
3. **Structure + Content + Centrality** -> powerful embeddings
4. **Production patterns** ensure scalable, maintainable code

### Next Steps

- Apply these techniques to your own datasets
- Explore other GDS algorithms (see GDS documentation)
- Build recommendation systems, fraud detection, knowledge graphs
- Consider **Aura Graph Analytics** for on-demand scalable GDS
