# Neo4j RAG System - Setup & First Steps

This notebook will guide you through:
1. Setting up your environment
2. Connecting to Neo4j
3. Loading your first document
4. Performing basic searches

## Prerequisites
- Neo4j running in Docker
- Python environment activated
- Dependencies installed

## 1. Environment Setup & Verification

In [None]:
# Import required libraries
import sys
import os
sys.path.append('..')  # Add parent directory to path

# Verify installations
try:
    import neo4j
    print("✅ Neo4j driver:", neo4j.__version__)
except ImportError:
    print("❌ Neo4j driver not installed")

try:
    from sentence_transformers import SentenceTransformer
    print("✅ Sentence Transformers installed")
except ImportError:
    print("❌ Sentence Transformers not installed")

try:
    from docling.document_converter import DocumentConverter
    print("✅ Docling installed")
except ImportError:
    print("❌ Docling not installed")

print("\n📊 Python version:", sys.version)

## 2. Connect to Neo4j Database

In [None]:
from neo4j_rag import Neo4jRAG

# Initialize connection
rag = Neo4jRAG(
    uri="bolt://localhost:7687",
    username="neo4j",
    password="password"
)

# Check connection and get stats
stats = rag.get_stats()
print("✅ Connected to Neo4j!")
print(f"📊 Current database stats:")
print(f"   - Documents: {stats['documents']}")
print(f"   - Chunks: {stats['chunks']}")

## 3. Load Your First Document

In [None]:
# Create a simple test document
test_document = """
Neo4j is a powerful graph database that excels at managing highly connected data.
It uses nodes, relationships, and properties to represent and store data.

Key features of Neo4j include:
1. ACID compliance for reliable transactions
2. Cypher query language for intuitive graph queries
3. High performance for connected data operations
4. Scalability for large datasets

Graph databases are particularly useful for:
- Social networks
- Recommendation engines
- Fraud detection
- Knowledge graphs
"""

# Add document to Neo4j
doc_id = rag.add_document(
    content=test_document,
    metadata={
        "source": "notebook_example",
        "category": "tutorial",
        "author": "Neo4j RAG System"
    }
)

print(f"✅ Document added successfully!")
print(f"📄 Document ID: {doc_id}")

# Check updated stats
new_stats = rag.get_stats()
print(f"\n📊 Updated stats:")
print(f"   - Documents: {stats['documents']} → {new_stats['documents']}")
print(f"   - Chunks: {stats['chunks']} → {new_stats['chunks']}")

## 4. Basic Vector Search

In [None]:
# Perform a vector search
query = "What are graph databases used for?"
print(f"🔍 Searching for: '{query}'\n")

results = rag.vector_search(query, k=3)

print(f"Found {len(results)} relevant chunks:\n")
for i, result in enumerate(results, 1):
    print(f"Result {i}:")
    print(f"  Score: {result['score']:.3f}")
    print(f"  Text: {result['text'][:150]}...")
    print(f"  Source: {result.get('metadata', {}).get('source', 'Unknown')}")
    print()

## 5. Hybrid Search (Vector + Keyword)

In [None]:
# Hybrid search combines semantic and keyword matching
query = "ACID transactions Neo4j"
print(f"🔍 Hybrid search for: '{query}'\n")

results = rag.hybrid_search(query, k=3)

print(f"Found {len(results)} relevant chunks:\n")
for i, result in enumerate(results, 1):
    print(f"Result {i}:")
    print(f"  Combined Score: {result['score']:.3f}")
    print(f"  Text: {result['text'][:150]}...")
    print()

## 6. Load a PDF Document (Advanced)

In [None]:
from docling_loader import DoclingDocumentLoader
from pathlib import Path

# Initialize Docling loader
loader = DoclingDocumentLoader(neo4j_rag=rag)

# Check if we have a sample PDF
sample_pdf = Path("../samples/arxiv_rag_paper.pdf")

if sample_pdf.exists():
    print(f"📄 Loading PDF: {sample_pdf.name}\n")
    
    # Load PDF with advanced extraction
    doc_info = loader.load_document(
        str(sample_pdf),
        metadata={"category": "research", "source": "arxiv"}
    )
    
    print("✅ PDF loaded successfully!")
    print(f"\n📊 Extraction statistics:")
    print(f"   - Characters: {doc_info['statistics']['character_count']:,}")
    print(f"   - Tables: {doc_info['statistics']['table_count']}")
    print(f"   - Images: {doc_info['statistics']['image_count']}")
    print(f"   - Sections: {doc_info['statistics']['section_count']}")
else:
    print("ℹ️ No sample PDF found. Run test_docling_pdf.py to download samples.")

## 7. Visualize Search Results

In [None]:
import matplotlib.pyplot as plt

# Search and visualize scores
query = "graph database"
results = rag.vector_search(query, k=10)

if results:
    scores = [r['score'] for r in results]
    indices = list(range(1, len(scores) + 1))
    
    plt.figure(figsize=(10, 6))
    plt.bar(indices, scores, color='steelblue')
    plt.xlabel('Result Rank')
    plt.ylabel('Similarity Score')
    plt.title(f'Search Results for: "{query}"')
    plt.grid(axis='y', alpha=0.3)
    
    # Add score values on bars
    for i, score in enumerate(scores):
        plt.text(i+1, score + 0.01, f'{score:.3f}', ha='center')
    
    plt.show()
else:
    print("No results found for visualization")

## 8. Clean Up

In [None]:
# Close the connection
rag.close()
print("✅ Connection closed")

# Optional: Clear database (uncomment if needed)
# rag = Neo4jRAG()
# rag.clear_database()
# print("🗑️ Database cleared")
# rag.close()

## Summary & Next Steps

You've successfully:
- ✅ Connected to Neo4j
- ✅ Loaded documents
- ✅ Performed vector and hybrid searches
- ✅ Visualized search results

### Next notebooks to explore:
1. **02_embeddings.ipynb** - Understanding embeddings and similarity
2. **03_document_processing.ipynb** - Advanced document extraction
3. **04_rag_pipeline.ipynb** - Building complete RAG systems
4. **05_optimization.ipynb** - Performance tuning

### Resources:
- [Neo4j Documentation](https://neo4j.com/docs/)
- [Project GitHub](https://github.com/yourusername/neo4j-rag-system)
- [RAG Tutorial](https://neo4j.com/blog/developer/rag-tutorial/)