A retrieval-augmented generation (RAG) system that combines document retrieval with local AI-powered question answering. Ask questions about your documents and get accurate, context-aware answers from LLaMA 3 running locally via Ollama.
The project was inspired by the growing need for efficient, privacy-preserving AI systems that can answer questions based on personal or domain-specific documents without relying on external APIs. With the rise of large language models like LLaMA, we wanted to combine retrieval-augmented generation (RAG) with local vector search to create a tool that enhances factual accuracy and reduces hallucinations in AI responses, making it accessible for developers and researchers.
This RAG Pipeline processes user-provided documents (text and PDF files), converts them into vector embeddings, stores them in a FAISS index for fast similarity search, retrieves relevant context based on queries, and generates natural language answers using a local LLaMA model via Ollama. It provides an interactive Q&A interface where users can ask questions about their documents, with answers grounded in the retrieved content, ensuring reliable and context-aware responses.
We built the system using Python with key libraries:
- LangChain: For the RAG pipeline orchestration
- Sentence Transformers: For text embeddings
- FAISS: For vector storage and similarity search
- PyPDF: For document parsing
- LlamaIndex: For indexing
- Ollama: For running LLaMA 3 locally
The architecture includes modular components—data loader, embeddings generator, vector store, retriever, and generator—integrated into a main script that handles initialization and interactive querying. The vector store persists across sessions for efficiency, and the pipeline uses LangChain's RetrievalQA chain to combine retrieval and generation.
- Local LLM Integration: Ensuring model availability and compatibility with Ollama
- Vector Database Optimization: Handling large document sets without memory overflow
- Document Format Handling: Parsing diverse formats (especially PDFs with encoding issues)
- Embedding Quality: Handling short or noisy text effectively
- Balance: Achieving coherence between retrieval relevance and generation quality
- Debugging: Managing end-to-end pipeline error handling for missing dependencies
Fully functional, end-to-end RAG system running entirely locally
No external APIs required—complete data privacy
Modular, extensible architecture for easy improvements
Efficient FAISS-based vector search for scalability
Practical integration of LLaMA for high-quality text generation
Interactive user experience with source document attribution
- Architecture: Importance of modular design for maintainability and scalability
- Vector Databases: Trade-offs between speed and accuracy in FAISS
- Local LLMs: Value of self-hosted solutions for data privacy
- Optimization: Balancing computational resources with performance
- Prompt Engineering: Techniques to improve answer quality through context and phrasing
- Retrieval Strategies: Advanced methods for better context relevance
Planned improvements:
- Support for more document types (images via OCR)
- Advanced chunking strategies for better retrieval
- Multi-modal inputs and processing
- Domain-specific LLaMA fine-tuning
- Web UI for broader accessibility
- Hybrid retrieval (keyword + semantic search)
- Cloud deployment options
- Evaluation metrics and user feedback loops
Languages & Frameworks:
- Python
- LangChain
- Sentence Transformers
- FAISS (Vector Database)
- PyPDF
- LlamaIndex
- Ollama API
Platforms: Local execution (Windows/Linux/Mac)
Cloud Services: None (fully local)
- Python 3.8+
- Ollama installed and running
- LLaMA 3 model downloaded:
ollama pull llama3
-
Clone the repository
git clone https://github.com/indrahacks/RAG_Hackathon.git cd RAG_Hackathon -
Create virtual environment
python -m venv .venv .venv\Scripts\activate # On Windows source .venv/bin/activate # On macOS/Linux
-
Install dependencies
pip install -r requirements.txt
-
Ensure Ollama is running
ollama serve
-
Add sample documents (place
.txtor.pdffiles indata/sample_documents/) -
Run the pipeline
python main.py
-
Ask questions - Follow the interactive prompts to query your documents
rag_project/
├── main.py # Entry point
├── requirements.txt # Python dependencies
├── config/
│ ├── __init__.py
│ └── settings.py # Configuration settings
├── data/
│ ├── processed/ # (Reserved for processed data)
│ └── sample_documents/ # Place your .txt/.pdf files here
├── faiss_index/
│ └── index.faiss # Persisted vector index
└── src/
├── __init__.py
├── data_loader.py # Document loading logic
├── embeddings.py # Embedding generation
├── vector_store.py # FAISS integration
├── retriever.py # Context retrieval
├── generator.py # LLaMA integration
└── rag_pipeline.py # RAG chain orchestration
- Setup Environment: Install dependencies, start Ollama, pull LLaMA 3
- Prepare Data: Add sample documents to
data/sample_documents/ - Run the System: Execute
python main.pyto initialize the pipeline - Demonstrate Features:
- Show step-by-step initialization output
- Ask sample questions about documents
- Highlight retrieval accuracy and generation quality
- Display source documents used for answers
- Key Points to Emphasize:
- Scalability: Handles multiple documents efficiently
- Privacy: No cloud services or external APIs
- Accuracy: Retrieval-augmented approach reduces hallucinations
- Local Execution: Runs entirely on-device for data security
LLaMA 3 is the core generative AI model used for the hardest task—converting retrieved context into natural language answers. It's used in:
src/generator.py: Model initialization and connectionsrc/rag_pipeline.py: Integration into RetrievalQA chainmain.py: Interactive question answering loop
The choice of LLaMA 3 ensures high-quality, coherent responses while maintaining full control over the model through local execution.
MIT License
Built with at the RAG Hackathon | GitHub Profile