# Milestone 6: Graph-Enhanced RAG

This notebook demonstrates graph-enhanced RAG with:
1. Question rephrasing for better retrieval
2. Anchor document retrieval
3. Neighboring chunk expansion
4. Answer composition with citations

## Setup

In [None]:
# Import required modules
import sys
sys.path.append('..')

from src import graph_rag

print("Graph RAG module loaded successfully!")

## Step 1: Question Rephrasing

Generate multiple phrasings of the same question.

In [None]:
# Test question rephrasing
original_question = "What is personal data?"

print(f"Original: {original_question}")
print("\nRephrased versions:")
print("=" * 60)

rephrasings = graph_rag.rephrase_question(original_question)

for i, rephrasing in enumerate(rephrasings, 1):
    print(f"{i}. {rephrasing}")

print("\nNote: With LLM API key, these would be intelligent semantic variations")

## Step 2: Anchor Retrieval

Retrieve anchor documents using all query variations.

In [None]:
# Test anchor retrieval
queries = [
    "What is personal data?",
    "Can you explain personal data?",
    "Define personal data according to GDPR"
]

print("Retrieving anchor documents...\n")
anchors = graph_rag.anchor_retrieve(queries, top_k=3)

print("Anchor Documents:")
print("=" * 60)

for i, anchor in enumerate(anchors, 1):
    print(f"\n{i}. {anchor['id']}")
    print(f"   Content: {anchor['content']}")
    print(f"   Score: {anchor['score']}")
    print(f"   Neighbors: {anchor['neighbors']}")

## Step 3: Neighbor Expansion

Expand retrieval to include neighboring chunks.

In [None]:
# Test neighbor expansion
print(f"Starting with {len(anchors)} anchor documents")
print("\nExpanding to neighbors...\n")

expanded = graph_rag.neighbor_retrieve(anchors, expand_depth=1)

print(f"Expanded to {len(expanded)} total documents")
print("\nExpanded Document Set:")
print("=" * 60)

for i, doc in enumerate(expanded[:6], 1):  # Show first 6
    doc_type = doc['metadata'].get('type', 'anchor')
    print(f"\n{i}. {doc['id']} ({doc_type})")
    print(f"   Content: {doc['content'][:80]}...")
    print(f"   Score: {doc['score']:.2f}")

## Step 4: Answer Composition with Citations

Generate answer with explicit source citations.

In [None]:
# Test answer composition
question = "What are the principles of data processing?"

print(f"Question: {question}")
print("\nComposing answer with citations...\n")

result = graph_rag.compose_answer_with_citations(
    question=question,
    docs=expanded
)

print("Answer with Citations:")
print("=" * 60)
print(result['answer'])
print("\n" + "=" * 60)

print("\nCitations:")
for citation in result['citations']:
    print(f"  - {citation['doc_id']} (Article {citation['article']}) - {citation['relevance']} relevance")

print(f"\nMetadata:")
print(f"  Sources: {result['num_sources']} anchors")
print(f"  Expanded: {result['num_expanded']} neighbors")

## Step 5: Complete Graph RAG Pipeline

Run the full Graph RAG system.

In [None]:
# Create Graph RAG instance
graph_rag_system = graph_rag.GraphRAG(
    rephrase_count=3,
    anchor_top_k=3,
    expand_depth=1
)

print("Graph RAG system initialized")
print(f"  Rephrase count: {graph_rag_system.rephrase_count}")
print(f"  Anchor top K: {graph_rag_system.anchor_top_k}")
print(f"  Expand depth: {graph_rag_system.expand_depth}")

## Step 6: Run Graph RAG Query

Execute complete graph-enhanced retrieval.

In [None]:
# Run complete pipeline
question = "What rights do individuals have under GDPR?"

print(f"Question: {question}")
print("\n" + "=" * 60)

result = graph_rag_system.query(question)

print("\n" + "=" * 60)
print("\nGRAPH RAG RESULT:")
print(f"Question: {result['question']}")
print(f"\nAnswer:\n{result['answer']}")
print(f"\nCitations:")
for citation in result['citations']:
    print(f"  [{citation['doc_id']}] Article {citation['article']}")
print(f"\nRetrieval Stats:")
print(f"  Anchor docs: {result['num_sources']}")
print(f"  Expanded docs: {result['num_expanded']}")

## Step 7: Compare with Baseline RAG

Compare Graph RAG with simple baseline RAG.

In [None]:
# Comparison
from src import rag_baseline

baseline_rag = rag_baseline.BaselineRAG(top_k=3)
query = "What is data minimization?"

print("Comparison: Baseline RAG vs. Graph RAG")
print("=" * 60)

print("\n1. BASELINE RAG:")
baseline_result = baseline_rag.query(query)
print(f"   Answer: {baseline_result['answer'][:200]}...")
print(f"   Sources: {baseline_result['num_sources']} documents")

print("\n2. GRAPH RAG:")
graph_result = graph_rag_system.query(query)
print(f"   Answer: {graph_result['answer'][:200]}...")
print(f"   Sources: {graph_result['num_sources']} anchors + {graph_result['num_expanded']} neighbors")
print(f"   Citations: {len(graph_result['citations'])} explicit")

print("\nAdvantages of Graph RAG:")
print("  ✓ Multiple query variations for better recall")
print("  ✓ Neighbor expansion for broader context")
print("  ✓ Explicit citation tracking")
print("  ✓ Document relationship awareness")

## Step 8: Test with Complex Queries

Test Graph RAG on complex, multi-faceted questions.

In [None]:
# Complex test queries
complex_queries = [
    "How does GDPR define consent and what are the requirements?",
    "What obligations do data controllers have regarding data breaches?",
    "Explain the relationship between data protection principles and individual rights"
]

print("Testing Graph RAG on complex queries:")
print("=" * 60)

for i, query in enumerate(complex_queries, 1):
    print(f"\n{i}. {query}")
    result = graph_rag_system.query(query)
    print(f"   Answer: {result['answer'][:150]}...")
    print(f"   Citations: {len(result['citations'])}")
    print(f"   Total docs: {result['num_sources'] + result['num_expanded']}")

## Summary

In this notebook, we:
- ✅ Implemented question rephrasing
- ✅ Created anchor document retrieval
- ✅ Added neighbor expansion
- ✅ Composed answers with explicit citations
- ✅ Built complete Graph RAG pipeline
- ✅ Compared with baseline RAG
- ✅ Tested with complex queries

Next: Proceed to `07_responsible_ai_and_tests.ipynb` for testing and evaluation.