Skip to content

gpalli/graphrag-tutorial-v1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GraphRAG Tutorial v1

A comprehensive guide to implementing GraphRAG (Graph-based Retrieval-Augmented Generation) with knowledge graphs, entity extraction, and graph-based retrieval using LangChain, ChromaDB, and Ollama.

🕸️ What is GraphRAG?

GraphRAG is an advanced RAG approach that creates knowledge graphs from your documents and uses them for better retrieval and reasoning. Instead of just matching similar text chunks, GraphRAG understands relationships between entities and can perform complex reasoning.

✨ Features

  • Knowledge Graph Construction: Automatically extracts entities and relationships from documents
  • Multi-format Support: Loads PDF, Markdown (.md), text (.txt), and RTF (.rtf) files
  • Entity Extraction: Uses LLMs to identify people, places, concepts, and events
  • Relationship Mapping: Discovers connections between entities
  • Graph-based Retrieval: Queries the knowledge graph for relevant information
  • Hybrid Search: Combines graph traversal with vector similarity
  • Multiple LLM Support: Works with Ollama, OpenAI, and other providers
  • Interactive Graph Visualization: Visualize your knowledge graph
  • Source Attribution: Shows which documents and relationships were used

🏗️ Architecture

Documents → Entity Extraction → Knowledge Graph → Graph Retrieval → LLM Response
     ↓              ↓                ↓              ↓
  PDF/MD/TXT    People/Places    Nodes/Edges    Graph Queries

🚀 Quick Start

1. Prerequisites

  • Python 3.8+ (recommended: Python 3.10)
  • Ollama installed
  • Git

2. Installation

Prerequisites:

  • Python 3.8+ (recommended: Python 3.10)
  • pyenv installed
  • Ollama installed
  • Git

Base Installation (Required for both versions):

# Clone the repository
git clone https://github.com/gpalli/graphrag-tutorial-v1.git
cd graphrag-tutorial-v1

# Install Python version (if not already installed)
pyenv install 3.10.12

# Set local Python version
pyenv local 3.10.12

# Create virtual environment with pyenv
pyenv virtualenv 3.10.12 graphrag-tutorial-v1

# Activate virtual environment
pyenv activate graphrag-tutorial-v1

# Install base dependencies
pip install -r requirements.txt

# Download spaCy language models
python -m spacy download en_core_web_sm  # English (default)
# For Spanish documents, also install:
# python -m spacy download es_core_news_sm

# Download Ollama models
ollama pull llama3.1:latest
ollama pull nomic-embed-text

Additional Setup for Optimized Version (Optional):

# Install optimization dependencies (only if using optimized version)
python install_optimized_deps.py

# Or install manually
pip install tqdm psutil memory-profiler

Note: The optimized version uses the same base installation as the standard version. The additional dependencies are only needed if you want to use the performance optimizations.

Quick Setup Reference

For Standard Version Only:

git clone https://github.com/gpalli/graphrag-tutorial-v1.git
cd graphrag-tutorial-v1
pyenv install 3.10.12
pyenv local 3.10.12
pyenv virtualenv 3.10.12 graphrag-tutorial-v1
pyenv activate graphrag-tutorial-v1
pip install -r requirements.txt
python -m spacy download en_core_web_sm
ollama pull llama3.1:latest
ollama pull nomic-embed-text

For Optimized Version:

git clone https://github.com/gpalli/graphrag-tutorial-v1.git
cd graphrag-tutorial-v1
pyenv install 3.10.12
pyenv local 3.10.12
pyenv virtualenv 3.10.12 graphrag-tutorial-v1
pyenv activate graphrag-tutorial-v1
pip install -r requirements.txt
python -m spacy download en_core_web_sm
python install_optimized_deps.py  # Additional step
ollama pull llama3.1:latest
ollama pull nomic-embed-text

For Both Versions (Recommended): Use the optimized version setup above - it includes everything needed for both versions.

Why pyenv?

Benefits of using pyenv:

  • 🐍 Multiple Python versions: Easy switching between Python versions
  • 🔒 Isolated environments: Each project gets its own Python version
  • 🚀 Fast switching: Quick activation/deactivation of environments
  • 📦 No system conflicts: Doesn't interfere with system Python
  • 🔄 Easy cleanup: Simple environment removal when done

Alternative: Standard venv (if you prefer)

# If you prefer standard venv instead of pyenv
python -m venv venv
source venv/bin/activate  # On macOS/Linux
# or
venv\Scripts\activate     # On Windows

Troubleshooting pyenv:

# Check pyenv installation
pyenv --version

# List available Python versions
pyenv install --list

# List installed Python versions
pyenv versions

# Deactivate environment
pyenv deactivate

# Remove environment
pyenv uninstall graphrag-tutorial-v1

PyEnv Environment Management

For easy pyenv environment setup and cleanup, use the provided scripts:

Setup PyEnv Environment:

# Run the setup script
./setup_pyenv.sh

# This will:
# - Install Python 3.10.12 if not available
# - Create virtual environment 'graphrag-tutorial-v1'
# - Install all requirements
# - Download spaCy English model
# - Install optimized dependencies (if available)

Destroy/Cleanup PyEnv Environment:

# Run the destroy script
./destroy_pyenv.sh

# Choose from options:
# 1. Remove only the virtual environment
# 2. Remove the Python version and all its virtual environments
# 3. Remove all Python versions and virtual environments
# 4. Complete pyenv removal (nuclear option)
# 5. Cancel

Manual PyEnv Commands:

# Activate environment
pyenv activate graphrag-tutorial-v1

# Deactivate environment
pyenv deactivate

# Check current status
pyenv versions
pyenv virtualenvs

Multi-Language Support

GraphRAG supports multiple languages through spaCy models. Here are the available language models:

English (Default):

python -m spacy download en_core_web_sm

Spanish:

python -m spacy download es_core_news_sm    # Small model (recommended)
python -m spacy download es_core_news_md    # Medium model (with word vectors)
python -m spacy download es_core_news_lg    # Large model (detailed word vectors)

Other Languages:

# German
python -m spacy download de_core_news_sm

# French
python -m spacy download fr_core_news_sm

# Italian
python -m spacy download it_core_news_sm

# Portuguese
python -m spacy download pt_core_news_sm

# Dutch
python -m spacy download nl_core_news_sm

# Chinese
python -m spacy download zh_core_web_sm

# Japanese
python -m spacy download ja_core_news_sm

Using Different Languages: The system will automatically detect the language of your documents and use the appropriate spaCy model if available. For best results with Spanish documents, install the Spanish model:

# Install Spanish model
python -m spacy download es_core_news_sm

# Your Spanish documents will now be processed with better accuracy
python build_knowledge_graph.py --reset

Example with Spanish Documents:

# Install Spanish spaCy model
python -m spacy download es_core_news_sm

# Test entity extraction in Spanish
python entity_extractor.py --text "El presidente de Microsoft, Satya Nadella, anunció nuevas inversiones en inteligencia artificial en la sede de la empresa en Redmond, Washington." --language es

# Build knowledge graph with Spanish documents
python build_knowledge_graph.py --reset

Language Detection: The system automatically detects the language of your documents when using --language auto (default). For better performance, you can specify the language explicitly:

# Force English processing
python build_knowledge_graph.py --language en

# Force Spanish processing  
python build_knowledge_graph.py --language es

# Auto-detect language (default)
python build_knowledge_graph.py --language auto

Test Spanish Language Support:

# Test Spanish entity extraction
python test_spanish.py

# This will test:
# - Language detection
# - Entity extraction with different language models
# - Comparison between auto, Spanish, and English models

3. Basic Usage

Standard Version (Recommended for small datasets)

# Extract entities and build knowledge graph
python build_knowledge_graph.py --reset

# Query the GraphRAG system
python query_graph.py "What is the relationship between X and Y?"

# Visualize the knowledge graph
python visualize_graph.py

Optimized Version (Recommended for large datasets)

The optimized version provides significant speed improvements through parallel processing, caching, and memory optimization.

# Install additional optimization dependencies
python install_optimized_deps.py

# Build knowledge graph with optimizations (3-8x faster)
python optimized_build_knowledge_graph.py --reset

# Use specific number of workers for your system
python optimized_build_knowledge_graph.py --reset --max-workers 4

# Resume from previous progress if interrupted
python optimized_build_knowledge_graph.py --resume

# Query the GraphRAG system (same as standard)
python query_graph.py "What is the relationship between X and Y?"

# Compare performance between versions
python performance_comparison.py

Incremental Updates (Built into both versions)

Both standard and optimized versions now support incremental updates by default, only processing new or modified documents.

# First build (processes all documents)
python build_knowledge_graph.py --reset
# or
python optimized_build_knowledge_graph.py --reset

# Add new documents to data/ folder, then run (only processes new/modified files):
python build_knowledge_graph.py
# or
python optimized_build_knowledge_graph.py

# Force rebuild all documents (if needed)
python build_knowledge_graph.py --force-rebuild
# or
python optimized_build_knowledge_graph.py --force-rebuild

# Query the GraphRAG system (same as other versions)
python query_graph.py "What is the relationship between X and Y?"

🖥️ Viewing and Exploring Your GraphRAG

1. Build Your Knowledge Graph

First, you need to create the knowledge graph from your documents:

# Standard version
python build_knowledge_graph.py --reset

# Optimized version (recommended)
python optimized_build_knowledge_graph.py --reset

2. Query Your GraphRAG System

Once the knowledge graph is built, you can query it:

# Interactive query mode
python query_graph.py

# Direct query
python query_graph.py "What entities are related to security?"

# Complex relationship queries
python query_graph.py "How does X influence Y through Z?"

3. Visualize Your Knowledge Graph

View your knowledge graph interactively:

# Launch interactive visualization
python visualize_graph.py

# Save graph as image
python visualize_graph.py --save-image graph.png

# Export graph data
python visualize_graph.py --export-json graph_data.json

4. Explore Graph Statistics

Check what was extracted from your documents:

# View graph statistics
python -c "
from graph_builder import KnowledgeGraph
kg = KnowledgeGraph()
kg.load_graph('knowledge_graph.pkl')
stats = kg.get_graph_statistics()
print(f'Entities: {stats[\"total_entities\"]}')
print(f'Relationships: {stats[\"total_relationships\"]}')
print(f'Entity types: {list(stats[\"entity_types\"].keys())}')
"

5. Interactive Exploration Examples

Find all entities of a specific type:

python query_graph.py "Show me all people mentioned in the documents"

Discover relationships:

python query_graph.py "What is the relationship between Microsoft and Bill Gates?"

Complex reasoning:

python query_graph.py "How does technology influence business processes?"

Entity connections:

python query_graph.py "What entities are connected to security concepts?"

6. What You'll See When Running GraphRAG

During Knowledge Graph Building:

Loading documents from data
Loaded 25 PDF documents
Loaded 15 Markdown documents
Split documents into 150 chunks
Extracting entities: 100%|████████| 150/150 [02:30<00:00, 1.00it/s]
Extracted 1250 entities and 890 relationships
Building knowledge graph...
Knowledge graph built in 15.23 seconds

KNOWLEDGE GRAPH SUMMARY
============================================================
Total entities: 1250
Total relationships: 890
Graph density: 0.0234
Connected components: 45

Entity types:
  PERSON: 156
  ORG: 89
  CONCEPT: 445
  LOCATION: 67
  EVENT: 123

Relationship types:
  works_for: 234
  located_in: 156
  related_to: 445
  influences: 89

When Querying:

Query: "What entities are related to security?"

Found 23 entities related to security:
- Security (CONCEPT) - confidence: 0.95
- Cybersecurity (CONCEPT) - confidence: 0.92
- Authentication (CONCEPT) - confidence: 0.88
- Firewall (OBJECT) - confidence: 0.85
- Encryption (CONCEPT) - confidence: 0.83

Relationships found:
- Security --related_to--> Cybersecurity
- Security --influences--> Authentication
- Firewall --part_of--> Security
- Encryption --enables--> Security

Sources: document1.pdf, document3.pdf, security_guide.md

When Visualizing:

  • Interactive graph opens in your browser
  • Zoom and pan to explore the network
  • Click entities to see details
  • Hover over edges to see relationships
  • Filter by entity type or confidence

Performance Comparison

Dataset Size Standard Time Optimized Time Speedup
Small (10 docs) 2-5 minutes 30-60 seconds 3-5x faster
Medium (50 docs) 10-20 minutes 2-5 minutes 4-6x faster
Large (100+ docs) 30-60 minutes 5-15 minutes 4-8x faster

Which Version Should I Use?

Use Standard Version when:

  • Working with small datasets (< 20 documents)
  • Learning GraphRAG concepts
  • Limited system resources
  • Simple, one-time processing

Use Optimized Version when:

  • Working with large datasets (20+ documents)
  • Processing documents repeatedly
  • Need faster processing times
  • Have multi-core CPU and sufficient RAM
  • Want progress tracking and resume capability

Key Optimized Features:

  • 🚀 Parallel Processing: Uses all CPU cores
  • 💾 Document Caching: Avoids re-processing unchanged files
  • 📊 Progress Tracking: Real-time progress bars
  • 🔄 Resume Capability: Continue from where you left off
  • Memory Optimization: Efficient batch processing

Key Incremental Features:

  • 🔄 True Incremental Updates: Only processes new/modified documents
  • 📁 File Change Detection: Uses file hashes to detect changes
  • 🔗 Knowledge Graph Merging: Merges new entities with existing graph
  • Fast Updates: Subsequent runs are much faster
  • 📊 State Tracking: Tracks processed files and build state

📁 Project Structure

graphrag-tutorial-v1/
├── data/                           # Place your documents here
├── knowledge_graph/                # Generated knowledge graph data
├── cache/                          # Document cache (optimized version)
├── build_knowledge_graph.py        # Standard knowledge graph builder (with incremental support)
├── optimized_build_knowledge_graph.py # Optimized knowledge graph builder (with incremental support)
├── query_graph.py                  # Query the GraphRAG system
├── visualize_graph.py              # Visualize the knowledge graph
├── entity_extractor.py             # Entity extraction utilities
├── graph_builder.py                # Knowledge graph construction
├── graph_retriever.py              # Graph-based retrieval
├── performance_comparison.py       # Performance benchmarking tool
├── install_optimized_deps.py       # Install optimization dependencies
├── config.yaml                     # Standard configuration
├── optimized_config.yaml           # Optimized configuration
├── OPTIMIZATION_GUIDE.md           # Detailed optimization guide
└── requirements.txt                # Python dependencies

🔧 Configuration

The system uses YAML configuration files to manage all settings. There are two configuration files:

  • config.yaml - Standard configuration for regular usage
  • optimized_config.yaml - Optimized configuration for better performance

Configuration Files

Standard Configuration (config.yaml):

llm:
  provider: "ollama"
  model: "llama3.1:latest"
  temperature: 0.1
  max_tokens: 2048

embedding:
  provider: "ollama"
  model: "nomic-embed-text"

graph:
  storage: "networkx"
  max_entities: 1000
  min_relationship_confidence: 0.7

entity_extraction:
  use_spacy: true
  use_llm: true
  min_entity_confidence: 0.7

Optimized Configuration (optimized_config.yaml):

llm:
  provider: "ollama"
  model: "llama3.1:latest"
  temperature: 0.1
  max_tokens: 2048

performance:
  enable_progress_bars: true
  save_intermediate_results: true

parallel:
  max_workers: null  # auto-detect
  entity_extraction_parallel: true

cache:
  enabled: true
  cache_dir: "cache"
  max_cache_size: "1GB"

Configuration Management

How the system loads configuration:

  1. Primary source: YAML configuration files (config.yaml or optimized_config.yaml)
  2. Fallback defaults: Built-in defaults if config files are missing
  3. Command-line override: Arguments can override any config value

Configuration precedence (highest to lowest):

  1. Command-line arguments (e.g., --llm-model)
  2. YAML configuration file values
  3. Built-in defaults

Examples:

# Use standard configuration (loads config.yaml)
python build_knowledge_graph.py

# Use optimized configuration (loads optimized_config.yaml)
python optimized_build_knowledge_graph.py

# Override specific values from command line
python build_knowledge_graph.py --llm-model llama3.1:7b --chunk-size 500

# Override provider and model
python query_graph.py "Your question" --llm-provider openai --llm-model gpt-4

Environment Variables (Optional):

You can also set environment variables to override configuration:

# Set environment variables
export LLM_PROVIDER=ollama
export LLM_MODEL=llama3.1:latest

# Run without specifying model
python build_knowledge_graph.py

📊 Knowledge Graph Components

Entities

  • People: Names, roles, organizations
  • Places: Locations, addresses, regions
  • Concepts: Ideas, topics, themes
  • Events: Actions, occurrences, processes
  • Objects: Products, tools, resources

Relationships

  • Semantic: "is_a", "part_of", "related_to"
  • Temporal: "before", "after", "during"
  • Causal: "causes", "influences", "leads_to"
  • Spatial: "located_in", "near", "contains"

🔍 Query Types

Simple Entity Queries

# Find all entities related to "security"
query_graph("What entities are related to security?")

Relationship Queries

# Find relationships between two entities
query_graph("What is the relationship between X and Y?")

Complex Reasoning

# Multi-hop reasoning through the graph
query_graph("How does X influence Y through Z?")

🎯 Use Cases

  • Research: Connect concepts across multiple papers
  • Documentation: Understand relationships in technical docs
  • Knowledge Management: Build organizational knowledge graphs
  • Question Answering: Complex reasoning over multiple sources
  • Content Analysis: Discover hidden connections in text

🔬 Advanced Features

Graph Analytics

  • Centrality Analysis: Find most important entities
  • Community Detection: Group related entities
  • Path Finding: Discover connection paths
  • Graph Metrics: Analyze graph structure

Visualization

  • Interactive Graphs: Explore with NetworkX/Plotly
  • Entity Networks: Focus on specific entity types
  • Relationship Maps: Visualize connection patterns
  • Timeline Views: Show temporal relationships

Optimization Tips

For Standard Version:

# Adjust chunk size for better performance
python build_knowledge_graph.py --chunk-size 500 --chunk-overlap 100

# Override model from command line (overrides config.yaml)
python build_knowledge_graph.py --llm-model llama3.1:7b --llm-provider ollama

For Optimized Version:

# Find optimal worker count for your system
python performance_comparison.py --benchmark-workers

# Use optimized configuration (automatically loads optimized_config.yaml)
python optimized_build_knowledge_graph.py

# Monitor resource usage
htop  # or Activity Monitor on macOS

Memory Optimization:

# Reduce memory usage for large datasets
python optimized_build_knowledge_graph.py --max-workers 2 --chunk-size 500

# Enable memory mapping for very large datasets
# Edit optimized_config.yaml:
# memory:
#   enable_memory_mapping: true

Caching Benefits:

  • First run: Processes all documents (slower)
  • Subsequent runs: Only processes changed documents (much faster)
  • Cache location: cache/ directory (automatically managed)
  • Cache cleanup: Use --reset to clear cache

Incremental Update Benefits:

  • First run: Processes all documents and builds knowledge graph
  • Subsequent runs: Only processes new/modified documents
  • Knowledge graph merging: Adds new entities to existing graph
  • State tracking: Remembers which files have been processed
  • File change detection: Uses MD5 hashes to detect modifications

Example Workflow:

# 1. Initial build (processes all documents)
python incremental_build_knowledge_graph.py --reset

# 2. Add new documents to data/ folder
cp new_document.pdf data/

# 3. Update knowledge graph (only processes new document)
python incremental_build_knowledge_graph.py

# 4. Modify existing document
echo "New content" >> data/existing_document.txt

# 5. Update knowledge graph (only processes modified document)
python incremental_build_knowledge_graph.py

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Microsoft Research for GraphRAG concept
  • LangChain for the framework
  • Ollama for local LLM support
  • NetworkX for graph operations

📚 Learn More

About

GraphRAG Tutorial v1 - A comprehensive guide to implementing GraphRAG with knowledge graphs, entity extraction, and graph-based retrieval

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors