# üéì Workshop: Graph RAG with Neo4j and LangChain

## Overview
In this notebook, you'll learn how to build a **Graph RAG (Retrieval-Augmented Generation)** system using LangChain that:
1. Extracts entities and relationships from documents using LLMs
2. Stores them in a Neo4j knowledge graph
3. Uses LangChain's GraphCypherQAChain for intelligent querying
4. Generates answers using graph context

## What is Graph RAG?
Graph RAG combines:
- **Knowledge Graphs**: Structured representation of entities and their relationships
- **LLMs**: For extraction, retrieval, and generation
- **Neo4j**: Graph database for storing and querying connected data
- **LangChain**: Framework that simplifies Graph RAG implementation

## Architecture
```
Document ‚Üí Entity Extraction ‚Üí Relationship Extraction ‚Üí Neo4j Graph
                                                              ‚Üì
User Query ‚Üí LangChain GraphCypherQAChain ‚Üí Cypher Query ‚Üí Answer
```

---

## üì¶ Step 1: Install Dependencies

Install LangChain and required packages

In [None]:
# Install dependencies
!pip install -q neo4j>=5.15.0 langchain>=0.1.0 langchain-community>=0.0.10 \
    langchain-groq>=1.0.0 groq>=0.4.0 python-dotenv>=1.0.0 pydantic>=2.5.0

## üîß Step 2: Setup Environment Variables

Load credentials from .env file

In [None]:
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Verify credentials
print("‚úÖ Environment variables loaded:")
print(f"  - Neo4j URI: {os.getenv('NEO4J_URI')[:30]}...")
print(f"  - Groq API Key: {'‚úì Set' if os.getenv('GROQ_API_KEY') else '‚úó Missing'}")

## üìö Step 3: Load Sample Data

Load the university research network document

In [None]:
# Read the sample document
with open('data/samples/university_research_network.md', 'r') as f:
    document_text = f.read()

print(f"üìÑ Document loaded: {len(document_text)} characters")
print(f"\nFirst 500 characters:\n{document_text[:500]}...")

## ü§ñ Step 4: Initialize LLM

Setup Groq LLM for entity extraction and Q&A

In [None]:
from langchain_groq import ChatGroq

# Initialize Groq LLM with Moonshot AI Kimi model
llm = ChatGroq(
    groq_api_key=os.getenv('GROQ_API_KEY'),
    model_name="moonshotai/kimi-k2-instruct-0905",
    temperature=0
)

print("‚úÖ LLM initialized (Moonshot AI Kimi K2)")

## üîó Step 5: Connect to Neo4j using LangChain

Use LangChain's Neo4jGraph class for easy graph operations

In [None]:
from langchain_community.graphs import Neo4jGraph

# Initialize Neo4j graph connection
graph = Neo4jGraph(
    url=os.getenv('NEO4J_URI'),
    username=os.getenv('NEO4J_USERNAME'),
    password=os.getenv('NEO4J_PASSWORD')
)

print("‚úÖ Connected to Neo4j")
print(f"\nüìä Graph Schema:\n{graph.schema}")

## üßπ Step 5.5: Clean Up Existing Data (Optional)

Delete all existing nodes and relationships to start fresh

In [None]:
# Delete all nodes and relationships
print("üßπ Cleaning up existing data...")

# Count existing data
count_result = graph.query("MATCH (n) RETURN count(n) as node_count")
node_count = count_result[0]['node_count'] if count_result else 0
print(f"   Found {node_count} existing nodes")

# Delete all
graph.query("MATCH (n) DETACH DELETE n")

# Verify
verify_result = graph.query("MATCH (n) RETURN count(n) as node_count")
remaining = verify_result[0]['node_count'] if verify_result else 0

print(f"‚úÖ Database cleaned! Remaining nodes: {remaining}")

## üß† Step 6: Extract Entities from Document

Use LLM to identify key entities (universities, people, research areas)

In [None]:
import json
from groq import Groq

# Initialize Groq client for structured extraction
groq_client = Groq(api_key=os.getenv('GROQ_API_KEY'))

# Entity extraction prompt
extraction_prompt = f"""Extract entities from this text. Return a JSON array of entities.
Each entity should have: name, type (UNIVERSITY, PERSON, RESEARCH_AREA, ORGANIZATION), description.

Text:
{document_text}

Return ONLY valid JSON array, no other text."""

response = groq_client.chat.completions.create(
    model="moonshotai/kimi-k2-instruct-0905",
    messages=[{"role": "user", "content": extraction_prompt}],
    temperature=0
)

# Parse entities
entities_text = response.choices[0].message.content
entities = json.loads(entities_text)

print(f"‚úÖ Extracted {len(entities)} entities")
for e in entities[:5]:
    print(f"  - {e['name']} ({e['type']})")

## üîó Step 7: Extract Relationships

Identify connections between entities

In [None]:
# Relationship extraction prompt
relationship_prompt = f"""Extract relationships between entities from this text.
Return a JSON array of relationships.
Each relationship should have: source (entity name), target (entity name), type (relationship type).

Text:
{document_text}

Return ONLY valid JSON array, no other text."""

response = groq_client.chat.completions.create(
    model="moonshotai/kimi-k2-instruct-0905",
    messages=[{"role": "user", "content": relationship_prompt}],
    temperature=0
)

# Parse relationships
relationships_text = response.choices[0].message.content
relationships = json.loads(relationships_text)

print(f"‚úÖ Extracted {len(relationships)} relationships")
for r in relationships[:5]:
    print(f"  - {r['source']} --[{r['type']}]--> {r['target']}")

## üèóÔ∏è Step 8: Build Knowledge Graph in Neo4j

Store entities and relationships using LangChain's Neo4jGraph

In [None]:
import hashlib

# Create entities
for entity in entities:
    # Generate unique ID from name and type
    entity_id = hashlib.md5(f"{entity['name']}_{entity['type']}".encode()).hexdigest()[:16]
    
    cypher = f"""
    MERGE (e:Entity {{id: $id, name: $name, type: $type}})
    SET e.description = $description
    """
    graph.query(cypher, params={
        'id': entity_id,
        'name': entity['name'],
        'type': entity['type'],
        'description': entity.get('description', '')
    })

print(f"‚úÖ Created {len(entities)} entity nodes")

# Create relationships
for rel in relationships:
    cypher = f"""
    MATCH (source:Entity {{name: $source}})
    MATCH (target:Entity {{name: $target}})
    MERGE (source)-[r:{rel['type'].upper().replace(' ', '_')}]->(target)
    """
    try:
        graph.query(cypher, params={
            'source': rel['source'],
            'target': rel['target']
        })
    except Exception as e:
        print(f"‚ö†Ô∏è  Skipped relationship: {rel['source']} -> {rel['target']}")

print(f"‚úÖ Created relationships")
print("\nüéâ Knowledge graph built successfully!")

## üîç Step 9: Query the Graph with Cypher

Run direct Cypher queries to explore the graph

In [None]:
# Count entities by type
result = graph.query("""
MATCH (e:Entity)
RETURN e.type as type, count(*) as count
ORDER BY count DESC
""")

print("üìä Entity counts by type:")
for row in result:
    print(f"  - {row['type']}: {row['count']}")

# Find Stanford and its connections
result = graph.query("""
MATCH (e:Entity {name: 'Stanford University'})-[r]-(connected)
RETURN type(r) as relationship, connected.name as entity
LIMIT 10
""")

print("\nüîó Stanford University connections:")
for row in result:
    print(f"  - {row['relationship']} ‚Üí {row['entity']}")

## üí¨ Step 10: Graph RAG with LangChain's GraphCypherQAChain

Use LangChain's built-in chain for natural language queries over the graph

In [None]:
from langchain.chains import GraphCypherQAChain

# Refresh graph schema to help LLM understand structure
graph.refresh_schema()

# Create Graph RAG chain with better configuration
chain = GraphCypherQAChain.from_llm(
    llm=llm,
    graph=graph,
    verbose=True,
    return_intermediate_steps=True,
    allow_dangerous_requests=True  # Required for Cypher generation
)

print("‚úÖ GraphCypherQAChain initialized")
print(f"\nüìä Graph Schema:\n{graph.schema}")

## üéØ Step 11: Ask Questions!

Now let's ask questions about the knowledge graph

In [None]:
# Question 1: Count and list collaborations
question = "How many collaborations do we have and list them?"
print(f"‚ùì Question: {question}\n")

result = chain.invoke({"query": question})

print(f"\nüí° Answer: {result['result']}")
print(f"\nüîç Generated Cypher: {result['intermediate_steps'][0]['query']}")

In [None]:
# Question 2: Find specific entity
question = "Tell me about Stanford University"
print(f"‚ùì Question: {question}\n")

result = chain.invoke({"query": question})

print(f"\nüí° Answer: {result['result']}")

In [None]:
# Question 3: Find relationships
question = "Which entities are connected to Stanford University?"
print(f"‚ùì Question: {question}\n")

result = chain.invoke({"query": question})

print(f"\nüí° Answer: {result['result']}")

In [None]:
# Question 4: List partnerships
question = "List all the Partnerships"
print(f"‚ùì Question: {question}\n")

result = chain.invoke({"query": question})

print(f"\nüí° Answer: {result['result']}")
print("\nüí¨ Note: For complex queries like 'Which Stanford faculty research NLP?',")
print("    use GPT-4o or Claude-3.5 for better multi-hop reasoning.")

## üí° Step 12: Direct Cypher Queries (When LLM Struggles)

Sometimes the LLM-generated Cypher doesn't work perfectly. Here's how to write direct queries:

In [None]:
# Direct query for collaborations
print("üîç Direct Query: Universities that collaborate\n")
result = graph.query("""
MATCH (u1:Entity)-[r]-(u2:Entity)
WHERE u1.type = 'UNIVERSITY' AND u2.type = 'UNIVERSITY'
RETURN u1.name as university1, type(r) as relationship, u2.name as university2
LIMIT 10
""")

for row in result:
    print(f"  {row['university1']} --[{row['relationship']}]--> {row['university2']}")

In [None]:
# Direct query for research areas
print("üîç Direct Query: Research areas\n")
result = graph.query("""
MATCH (e:Entity)
WHERE e.type = 'RESEARCH_AREA'
RETURN e.name as research_area, e.description as description
LIMIT 10
""")

for row in result:
    print(f"  ‚Ä¢ {row['research_area']}: {row['description'][:100] if row['description'] else 'N/A'}...")

In [None]:
# Direct query for faculty at Stanford
print("üîç Direct Query: Faculty at Stanford\n")
result = graph.query("""
MATCH (stanford:Entity {name: 'Stanford University'})-[r]-(person:Entity)
WHERE person.type = 'PERSON'
RETURN person.name as faculty, type(r) as relationship, person.description as description
LIMIT 10
""")

for row in result:
    print(f"  ‚Ä¢ {row['faculty']}: {row['description'][:100] if row['description'] else 'N/A'}...")

## üéØ Key Takeaways

### What We Learned:

1. **LangChain Simplifies Graph RAG**
   - `Neo4jGraph`: Easy connection and query execution
   - `GraphCypherQAChain`: Automatic Cypher generation from natural language
   - Direct Cypher queries when you need precise control

2. **Graph RAG Workflow**
   - Extract entities and relationships using LLMs
   - Store in Neo4j knowledge graph
   - Query using natural language OR direct Cypher
   - LLM generates Cypher and interprets results

3. **Benefits of Graph RAG**
   - Structured knowledge representation
   - Relationship-aware retrieval
   - Multi-hop reasoning
   - Explainable queries (see generated Cypher)

4. **When to Use What**
   - ‚úÖ Use `GraphCypherQAChain` for exploratory questions
   - ‚úÖ Use direct Cypher for precise, production queries
   - ‚úÖ Combine both for best results

### LangChain Components Used:
- ‚úÖ `Neo4jGraph`: Graph database connection
- ‚úÖ `GraphCypherQAChain`: Natural language to Cypher
- ‚úÖ `ChatGroq`: LLM for generation

---

## üöÄ Next Steps

- Try with your own documents
- Experiment with different entity types
- Learn Cypher for better control
- Explore the **Hybrid RAG notebook** to combine graph + vector search!

---

**Workshop Complete!** üéâ