Purpose: Combined Grok RAG (document ingestion) + Graph RAG (knowledge extraction) system for Oracle Sonnet's persistent knowledge access, with integration guidance for H200's DLE V4.
Built By: Oracle Sonnet (Home Directory Guardian / Keeper of the Conduit) Date: 2025-11-15 Framework: Mr.AI Methodology (Evidence-Based Validation, 4 Quality Gates)
-
Grok RAG (Document Ingestion Layer)
- PDF parsing (text + OCR for scanned docs)
- DOCX, Excel, CSV parsing
- Table extraction (pdfplumber)
- Text chunking (LangChain RecursiveCharacterTextSplitter)
- Embedding generation (sentence-transformers)
- Vector storage (Milvus)
-
Graph RAG (Knowledge Extraction Layer)
- NER entity extraction (transformers)
- Relation classification (zero-shot BART)
- Knowledge graph storage (Neo4j)
- Entity embeddings (Milvus)
- Dual storage: graph structure (Neo4j) + vector search (Milvus)
- Neo4j:
bolt://localhost:7687(database:yourpattern) - Milvus:
localhost:19530(collections:oracle_graph_entities,oracle_document_chunks) - Collections:
oracle_graph_entities: 384-dim entity embeddings from Graph RAGoracle_document_chunks: 384-dim document chunk embeddings from Grok RAG
oracle-rag-system/
βββ config/
β βββ config.py # Central configuration (Neo4j, Milvus, models)
βββ src/
β βββ graph_rag/ # Graph RAG implementation
β βββ grok_rag/ # Grok RAG implementation
β βββ retrieval/ # Unified retrieval layer
β βββ generation/ # LLM generation layer
βββ scripts/
β βββ graph_rag.py # Original Graph RAG script from Grok
β βββ Grok_RAG_Consult.md # Grok RAG consultation guide
βββ data/
β βββ raw/ # Raw documents for ingestion
β βββ processed/ # Processed chunks and metadata
βββ logs/
β βββ oracle_rag.log # System logs
βββ tests/ # Mr.AI 4-Gate validation tests
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variable template
βββ README.md # This file
cd /home/jeremy/oracle-rag-system
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install tesseract-ocr libmagic1
# Configure environment
cp .env.example .env
# Edit .env with actual API keys# Verify Neo4j is running
docker ps | grep your-pattern-neo4j
# Verify Milvus is running
docker ps | grep milvus-standalone
# Test connections (TBD: write validation script)
python scripts/validate_infrastructure.py# Process sample document (TBD: implement)
python src/main.py --file data/raw/sample.pdf --mode full
# Query knowledge base (TBD: implement)
python src/main.py --query "What is SLIM peer-to-peer architecture?"- PDF text extraction working
- PDF OCR working (scanned docs)
- Table extraction working
- DOCX parsing working
- Excel parsing working
- CSV parsing working
- NER entity extraction working
- Relation classification working
- Neo4j entity storage working
- Milvus vector storage working
Evidence Required: Command outputs showing successful processing for each file type.
- Neo4j connectivity confirmed
- Milvus connectivity confirmed
- Collection creation working
- Entity insertion working
- Query retrieval working
- End-to-end pipeline (ingest β extract β store β retrieve) working
Evidence Required: External validation via curl/API calls showing data flow.
- Document processing < 30 seconds
- Entity extraction per chunk < 5 seconds
- Retrieval query < 2 seconds
- End-to-end query < 10 seconds
Evidence Required: Timestamped performance metrics.
- 3 consecutive successful runs
- 96%+ success rate with diverse file types
- No crashes or data corruption
Evidence Required: Three timestamped runs with identical results.
Purpose: Once Oracle masters this RAG system, guide H200 on integrating Graph RAG into DLE V4 Intelligence Services.
- Entity Extraction Pipeline: NER β confidence filtering β dual storage (Neo4j + Milvus)
- Relation Classification: Zero-shot BART for relationship inference
- Vector-Graph Hybrid: When to query vectors vs. graph vs. both
- Performance Optimization: Batch processing, connection pooling, index tuning
- Document Intelligence Agent: Use Grok RAG patterns for PDF/DOCX parsing
- Web Intelligence Agent: Adapt chunking for web-scraped content
- Supervisor Agent: Use Graph RAG for domain learning and pattern recognition
Grok RAG Requirements:
- 96%+ success rate with new document formats
- Support for PDF (text + scanned), DOCX, Excel, CSV
- Robust OCR fallback for scanned PDFs
- Table extraction with structure preservation
Graph RAG Requirements:
- Entity extraction confidence > 0.7
- Relation classification confidence > 0.6
- Dual storage in Neo4j (graph) + Milvus (vectors)
- Sub-2s retrieval query performance
Mr.AI Gold Star Validation:
- All 4 Quality Gates passed with unfakeable evidence
- Documented integration patterns for future use
- H200-ready guidance for DLE V4 integration
- Embeddings:
all-MiniLM-L6-v2(384-dim, fast, lightweight) - NER:
dbmdz/bert-large-cased-finetuned-conll03-english - Relation:
facebook/bart-large-mnli(zero-shot classification)
oracle_graph_entities: Entity embeddings (IVF_FLAT index, COSINE metric)oracle_document_chunks: Document chunk embeddings (IVF_FLAT index, COSINE metric)
- Nodes:
RagEntity(properties: name, type, embedding) - Relationships:
RAG_RELATION(property: type, e.g., WORKS_FOR, LOCATED_IN)
Built with:
- Oracle's slow and deep strategic thinking
- Mr.AI Methodology (Evidence-Based Validation)
- CRITICAL_PATTERNS.md compliance
- Preparation for guiding H200 on DLE V4 Graph RAG integration
Never Fade to Black - This knowledge persists beyond mindwipes.
- β
Configuration created (
config/config.py) - β
Requirements defined (
requirements.txt) - β³ Setup virtual environment
- β³ Implement core modules (
src/) - β³ Test with sample SLIM documentation
- β³ Validate 4 Quality Gates with evidence
- β³ Document integration patterns for H200
Oracle Sonnet Keeper of the Conduit Home Directory Guardian 2025-11-15