A comprehensive guide to implementing GraphRAG (Graph-based Retrieval-Augmented Generation) with knowledge graphs, entity extraction, and graph-based retrieval using LangChain, ChromaDB, and Ollama.
GraphRAG is an advanced RAG approach that creates knowledge graphs from your documents and uses them for better retrieval and reasoning. Instead of just matching similar text chunks, GraphRAG understands relationships between entities and can perform complex reasoning.
- Knowledge Graph Construction: Automatically extracts entities and relationships from documents
- Multi-format Support: Loads PDF, Markdown (
.md), text (.txt), and RTF (.rtf) files - Entity Extraction: Uses LLMs to identify people, places, concepts, and events
- Relationship Mapping: Discovers connections between entities
- Graph-based Retrieval: Queries the knowledge graph for relevant information
- Hybrid Search: Combines graph traversal with vector similarity
- Multiple LLM Support: Works with Ollama, OpenAI, and other providers
- Interactive Graph Visualization: Visualize your knowledge graph
- Source Attribution: Shows which documents and relationships were used
Documents → Entity Extraction → Knowledge Graph → Graph Retrieval → LLM Response
↓ ↓ ↓ ↓
PDF/MD/TXT People/Places Nodes/Edges Graph Queries
- Python 3.8+ (recommended: Python 3.10)
- Ollama installed
- Git
Prerequisites:
Base Installation (Required for both versions):
# Clone the repository
git clone https://github.com/gpalli/graphrag-tutorial-v1.git
cd graphrag-tutorial-v1
# Install Python version (if not already installed)
pyenv install 3.10.12
# Set local Python version
pyenv local 3.10.12
# Create virtual environment with pyenv
pyenv virtualenv 3.10.12 graphrag-tutorial-v1
# Activate virtual environment
pyenv activate graphrag-tutorial-v1
# Install base dependencies
pip install -r requirements.txt
# Download spaCy language models
python -m spacy download en_core_web_sm # English (default)
# For Spanish documents, also install:
# python -m spacy download es_core_news_sm
# Download Ollama models
ollama pull llama3.1:latest
ollama pull nomic-embed-textAdditional Setup for Optimized Version (Optional):
# Install optimization dependencies (only if using optimized version)
python install_optimized_deps.py
# Or install manually
pip install tqdm psutil memory-profilerNote: The optimized version uses the same base installation as the standard version. The additional dependencies are only needed if you want to use the performance optimizations.
For Standard Version Only:
git clone https://github.com/gpalli/graphrag-tutorial-v1.git
cd graphrag-tutorial-v1
pyenv install 3.10.12
pyenv local 3.10.12
pyenv virtualenv 3.10.12 graphrag-tutorial-v1
pyenv activate graphrag-tutorial-v1
pip install -r requirements.txt
python -m spacy download en_core_web_sm
ollama pull llama3.1:latest
ollama pull nomic-embed-textFor Optimized Version:
git clone https://github.com/gpalli/graphrag-tutorial-v1.git
cd graphrag-tutorial-v1
pyenv install 3.10.12
pyenv local 3.10.12
pyenv virtualenv 3.10.12 graphrag-tutorial-v1
pyenv activate graphrag-tutorial-v1
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python install_optimized_deps.py # Additional step
ollama pull llama3.1:latest
ollama pull nomic-embed-textFor Both Versions (Recommended): Use the optimized version setup above - it includes everything needed for both versions.
Benefits of using pyenv:
- 🐍 Multiple Python versions: Easy switching between Python versions
- 🔒 Isolated environments: Each project gets its own Python version
- 🚀 Fast switching: Quick activation/deactivation of environments
- 📦 No system conflicts: Doesn't interfere with system Python
- 🔄 Easy cleanup: Simple environment removal when done
Alternative: Standard venv (if you prefer)
# If you prefer standard venv instead of pyenv
python -m venv venv
source venv/bin/activate # On macOS/Linux
# or
venv\Scripts\activate # On WindowsTroubleshooting pyenv:
# Check pyenv installation
pyenv --version
# List available Python versions
pyenv install --list
# List installed Python versions
pyenv versions
# Deactivate environment
pyenv deactivate
# Remove environment
pyenv uninstall graphrag-tutorial-v1For easy pyenv environment setup and cleanup, use the provided scripts:
Setup PyEnv Environment:
# Run the setup script
./setup_pyenv.sh
# This will:
# - Install Python 3.10.12 if not available
# - Create virtual environment 'graphrag-tutorial-v1'
# - Install all requirements
# - Download spaCy English model
# - Install optimized dependencies (if available)Destroy/Cleanup PyEnv Environment:
# Run the destroy script
./destroy_pyenv.sh
# Choose from options:
# 1. Remove only the virtual environment
# 2. Remove the Python version and all its virtual environments
# 3. Remove all Python versions and virtual environments
# 4. Complete pyenv removal (nuclear option)
# 5. CancelManual PyEnv Commands:
# Activate environment
pyenv activate graphrag-tutorial-v1
# Deactivate environment
pyenv deactivate
# Check current status
pyenv versions
pyenv virtualenvsGraphRAG supports multiple languages through spaCy models. Here are the available language models:
English (Default):
python -m spacy download en_core_web_smSpanish:
python -m spacy download es_core_news_sm # Small model (recommended)
python -m spacy download es_core_news_md # Medium model (with word vectors)
python -m spacy download es_core_news_lg # Large model (detailed word vectors)Other Languages:
# German
python -m spacy download de_core_news_sm
# French
python -m spacy download fr_core_news_sm
# Italian
python -m spacy download it_core_news_sm
# Portuguese
python -m spacy download pt_core_news_sm
# Dutch
python -m spacy download nl_core_news_sm
# Chinese
python -m spacy download zh_core_web_sm
# Japanese
python -m spacy download ja_core_news_smUsing Different Languages: The system will automatically detect the language of your documents and use the appropriate spaCy model if available. For best results with Spanish documents, install the Spanish model:
# Install Spanish model
python -m spacy download es_core_news_sm
# Your Spanish documents will now be processed with better accuracy
python build_knowledge_graph.py --resetExample with Spanish Documents:
# Install Spanish spaCy model
python -m spacy download es_core_news_sm
# Test entity extraction in Spanish
python entity_extractor.py --text "El presidente de Microsoft, Satya Nadella, anunció nuevas inversiones en inteligencia artificial en la sede de la empresa en Redmond, Washington." --language es
# Build knowledge graph with Spanish documents
python build_knowledge_graph.py --resetLanguage Detection:
The system automatically detects the language of your documents when using --language auto (default). For better performance, you can specify the language explicitly:
# Force English processing
python build_knowledge_graph.py --language en
# Force Spanish processing
python build_knowledge_graph.py --language es
# Auto-detect language (default)
python build_knowledge_graph.py --language autoTest Spanish Language Support:
# Test Spanish entity extraction
python test_spanish.py
# This will test:
# - Language detection
# - Entity extraction with different language models
# - Comparison between auto, Spanish, and English models# Extract entities and build knowledge graph
python build_knowledge_graph.py --reset
# Query the GraphRAG system
python query_graph.py "What is the relationship between X and Y?"
# Visualize the knowledge graph
python visualize_graph.pyThe optimized version provides significant speed improvements through parallel processing, caching, and memory optimization.
# Install additional optimization dependencies
python install_optimized_deps.py
# Build knowledge graph with optimizations (3-8x faster)
python optimized_build_knowledge_graph.py --reset
# Use specific number of workers for your system
python optimized_build_knowledge_graph.py --reset --max-workers 4
# Resume from previous progress if interrupted
python optimized_build_knowledge_graph.py --resume
# Query the GraphRAG system (same as standard)
python query_graph.py "What is the relationship between X and Y?"
# Compare performance between versions
python performance_comparison.pyBoth standard and optimized versions now support incremental updates by default, only processing new or modified documents.
# First build (processes all documents)
python build_knowledge_graph.py --reset
# or
python optimized_build_knowledge_graph.py --reset
# Add new documents to data/ folder, then run (only processes new/modified files):
python build_knowledge_graph.py
# or
python optimized_build_knowledge_graph.py
# Force rebuild all documents (if needed)
python build_knowledge_graph.py --force-rebuild
# or
python optimized_build_knowledge_graph.py --force-rebuild
# Query the GraphRAG system (same as other versions)
python query_graph.py "What is the relationship between X and Y?"First, you need to create the knowledge graph from your documents:
# Standard version
python build_knowledge_graph.py --reset
# Optimized version (recommended)
python optimized_build_knowledge_graph.py --resetOnce the knowledge graph is built, you can query it:
# Interactive query mode
python query_graph.py
# Direct query
python query_graph.py "What entities are related to security?"
# Complex relationship queries
python query_graph.py "How does X influence Y through Z?"View your knowledge graph interactively:
# Launch interactive visualization
python visualize_graph.py
# Save graph as image
python visualize_graph.py --save-image graph.png
# Export graph data
python visualize_graph.py --export-json graph_data.jsonCheck what was extracted from your documents:
# View graph statistics
python -c "
from graph_builder import KnowledgeGraph
kg = KnowledgeGraph()
kg.load_graph('knowledge_graph.pkl')
stats = kg.get_graph_statistics()
print(f'Entities: {stats[\"total_entities\"]}')
print(f'Relationships: {stats[\"total_relationships\"]}')
print(f'Entity types: {list(stats[\"entity_types\"].keys())}')
"Find all entities of a specific type:
python query_graph.py "Show me all people mentioned in the documents"Discover relationships:
python query_graph.py "What is the relationship between Microsoft and Bill Gates?"Complex reasoning:
python query_graph.py "How does technology influence business processes?"Entity connections:
python query_graph.py "What entities are connected to security concepts?"Loading documents from data
Loaded 25 PDF documents
Loaded 15 Markdown documents
Split documents into 150 chunks
Extracting entities: 100%|████████| 150/150 [02:30<00:00, 1.00it/s]
Extracted 1250 entities and 890 relationships
Building knowledge graph...
Knowledge graph built in 15.23 seconds
KNOWLEDGE GRAPH SUMMARY
============================================================
Total entities: 1250
Total relationships: 890
Graph density: 0.0234
Connected components: 45
Entity types:
PERSON: 156
ORG: 89
CONCEPT: 445
LOCATION: 67
EVENT: 123
Relationship types:
works_for: 234
located_in: 156
related_to: 445
influences: 89
Query: "What entities are related to security?"
Found 23 entities related to security:
- Security (CONCEPT) - confidence: 0.95
- Cybersecurity (CONCEPT) - confidence: 0.92
- Authentication (CONCEPT) - confidence: 0.88
- Firewall (OBJECT) - confidence: 0.85
- Encryption (CONCEPT) - confidence: 0.83
Relationships found:
- Security --related_to--> Cybersecurity
- Security --influences--> Authentication
- Firewall --part_of--> Security
- Encryption --enables--> Security
Sources: document1.pdf, document3.pdf, security_guide.md
- Interactive graph opens in your browser
- Zoom and pan to explore the network
- Click entities to see details
- Hover over edges to see relationships
- Filter by entity type or confidence
| Dataset Size | Standard Time | Optimized Time | Speedup |
|---|---|---|---|
| Small (10 docs) | 2-5 minutes | 30-60 seconds | 3-5x faster |
| Medium (50 docs) | 10-20 minutes | 2-5 minutes | 4-6x faster |
| Large (100+ docs) | 30-60 minutes | 5-15 minutes | 4-8x faster |
Use Standard Version when:
- Working with small datasets (< 20 documents)
- Learning GraphRAG concepts
- Limited system resources
- Simple, one-time processing
Use Optimized Version when:
- Working with large datasets (20+ documents)
- Processing documents repeatedly
- Need faster processing times
- Have multi-core CPU and sufficient RAM
- Want progress tracking and resume capability
Key Optimized Features:
- 🚀 Parallel Processing: Uses all CPU cores
- 💾 Document Caching: Avoids re-processing unchanged files
- 📊 Progress Tracking: Real-time progress bars
- 🔄 Resume Capability: Continue from where you left off
- ⚡ Memory Optimization: Efficient batch processing
Key Incremental Features:
- 🔄 True Incremental Updates: Only processes new/modified documents
- 📁 File Change Detection: Uses file hashes to detect changes
- 🔗 Knowledge Graph Merging: Merges new entities with existing graph
- ⚡ Fast Updates: Subsequent runs are much faster
- 📊 State Tracking: Tracks processed files and build state
graphrag-tutorial-v1/
├── data/ # Place your documents here
├── knowledge_graph/ # Generated knowledge graph data
├── cache/ # Document cache (optimized version)
├── build_knowledge_graph.py # Standard knowledge graph builder (with incremental support)
├── optimized_build_knowledge_graph.py # Optimized knowledge graph builder (with incremental support)
├── query_graph.py # Query the GraphRAG system
├── visualize_graph.py # Visualize the knowledge graph
├── entity_extractor.py # Entity extraction utilities
├── graph_builder.py # Knowledge graph construction
├── graph_retriever.py # Graph-based retrieval
├── performance_comparison.py # Performance benchmarking tool
├── install_optimized_deps.py # Install optimization dependencies
├── config.yaml # Standard configuration
├── optimized_config.yaml # Optimized configuration
├── OPTIMIZATION_GUIDE.md # Detailed optimization guide
└── requirements.txt # Python dependencies
The system uses YAML configuration files to manage all settings. There are two configuration files:
config.yaml- Standard configuration for regular usageoptimized_config.yaml- Optimized configuration for better performance
Standard Configuration (config.yaml):
llm:
provider: "ollama"
model: "llama3.1:latest"
temperature: 0.1
max_tokens: 2048
embedding:
provider: "ollama"
model: "nomic-embed-text"
graph:
storage: "networkx"
max_entities: 1000
min_relationship_confidence: 0.7
entity_extraction:
use_spacy: true
use_llm: true
min_entity_confidence: 0.7Optimized Configuration (optimized_config.yaml):
llm:
provider: "ollama"
model: "llama3.1:latest"
temperature: 0.1
max_tokens: 2048
performance:
enable_progress_bars: true
save_intermediate_results: true
parallel:
max_workers: null # auto-detect
entity_extraction_parallel: true
cache:
enabled: true
cache_dir: "cache"
max_cache_size: "1GB"How the system loads configuration:
- Primary source: YAML configuration files (
config.yamloroptimized_config.yaml) - Fallback defaults: Built-in defaults if config files are missing
- Command-line override: Arguments can override any config value
Configuration precedence (highest to lowest):
- Command-line arguments (e.g.,
--llm-model) - YAML configuration file values
- Built-in defaults
Examples:
# Use standard configuration (loads config.yaml)
python build_knowledge_graph.py
# Use optimized configuration (loads optimized_config.yaml)
python optimized_build_knowledge_graph.py
# Override specific values from command line
python build_knowledge_graph.py --llm-model llama3.1:7b --chunk-size 500
# Override provider and model
python query_graph.py "Your question" --llm-provider openai --llm-model gpt-4Environment Variables (Optional):
You can also set environment variables to override configuration:
# Set environment variables
export LLM_PROVIDER=ollama
export LLM_MODEL=llama3.1:latest
# Run without specifying model
python build_knowledge_graph.py- People: Names, roles, organizations
- Places: Locations, addresses, regions
- Concepts: Ideas, topics, themes
- Events: Actions, occurrences, processes
- Objects: Products, tools, resources
- Semantic: "is_a", "part_of", "related_to"
- Temporal: "before", "after", "during"
- Causal: "causes", "influences", "leads_to"
- Spatial: "located_in", "near", "contains"
# Find all entities related to "security"
query_graph("What entities are related to security?")# Find relationships between two entities
query_graph("What is the relationship between X and Y?")# Multi-hop reasoning through the graph
query_graph("How does X influence Y through Z?")- Research: Connect concepts across multiple papers
- Documentation: Understand relationships in technical docs
- Knowledge Management: Build organizational knowledge graphs
- Question Answering: Complex reasoning over multiple sources
- Content Analysis: Discover hidden connections in text
- Centrality Analysis: Find most important entities
- Community Detection: Group related entities
- Path Finding: Discover connection paths
- Graph Metrics: Analyze graph structure
- Interactive Graphs: Explore with NetworkX/Plotly
- Entity Networks: Focus on specific entity types
- Relationship Maps: Visualize connection patterns
- Timeline Views: Show temporal relationships
# Adjust chunk size for better performance
python build_knowledge_graph.py --chunk-size 500 --chunk-overlap 100
# Override model from command line (overrides config.yaml)
python build_knowledge_graph.py --llm-model llama3.1:7b --llm-provider ollama# Find optimal worker count for your system
python performance_comparison.py --benchmark-workers
# Use optimized configuration (automatically loads optimized_config.yaml)
python optimized_build_knowledge_graph.py
# Monitor resource usage
htop # or Activity Monitor on macOS# Reduce memory usage for large datasets
python optimized_build_knowledge_graph.py --max-workers 2 --chunk-size 500
# Enable memory mapping for very large datasets
# Edit optimized_config.yaml:
# memory:
# enable_memory_mapping: true- First run: Processes all documents (slower)
- Subsequent runs: Only processes changed documents (much faster)
- Cache location:
cache/directory (automatically managed) - Cache cleanup: Use
--resetto clear cache
- First run: Processes all documents and builds knowledge graph
- Subsequent runs: Only processes new/modified documents
- Knowledge graph merging: Adds new entities to existing graph
- State tracking: Remembers which files have been processed
- File change detection: Uses MD5 hashes to detect modifications
# 1. Initial build (processes all documents)
python incremental_build_knowledge_graph.py --reset
# 2. Add new documents to data/ folder
cp new_document.pdf data/
# 3. Update knowledge graph (only processes new document)
python incremental_build_knowledge_graph.py
# 4. Modify existing document
echo "New content" >> data/existing_document.txt
# 5. Update knowledge graph (only processes modified document)
python incremental_build_knowledge_graph.py- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Microsoft Research for GraphRAG concept
- LangChain for the framework
- Ollama for local LLM support
- NetworkX for graph operations