Generative AI RAG Pipeline using LLaMA & FAISS

A retrieval-augmented generation (RAG) system that combines document retrieval with local AI-powered question answering. Ask questions about your documents and get accurate, context-aware answers from LLaMA 3 running locally via Ollama.

Inspiration

The project was inspired by the growing need for efficient, privacy-preserving AI systems that can answer questions based on personal or domain-specific documents without relying on external APIs. With the rise of large language models like LLaMA, we wanted to combine retrieval-augmented generation (RAG) with local vector search to create a tool that enhances factual accuracy and reduces hallucinations in AI responses, making it accessible for developers and researchers.

What it does

This RAG Pipeline processes user-provided documents (text and PDF files), converts them into vector embeddings, stores them in a FAISS index for fast similarity search, retrieves relevant context based on queries, and generates natural language answers using a local LLaMA model via Ollama. It provides an interactive Q&A interface where users can ask questions about their documents, with answers grounded in the retrieved content, ensuring reliable and context-aware responses.

How we built it

We built the system using Python with key libraries:

LangChain: For the RAG pipeline orchestration
Sentence Transformers: For text embeddings
FAISS: For vector storage and similarity search
PyPDF: For document parsing
LlamaIndex: For indexing
Ollama: For running LLaMA 3 locally

The architecture includes modular components—data loader, embeddings generator, vector store, retriever, and generator—integrated into a main script that handles initialization and interactive querying. The vector store persists across sessions for efficiency, and the pipeline uses LangChain's RetrievalQA chain to combine retrieval and generation.

Challenges we ran into

Local LLM Integration: Ensuring model availability and compatibility with Ollama
Vector Database Optimization: Handling large document sets without memory overflow
Document Format Handling: Parsing diverse formats (especially PDFs with encoding issues)
Embedding Quality: Handling short or noisy text effectively
Balance: Achieving coherence between retrieval relevance and generation quality
Debugging: Managing end-to-end pipeline error handling for missing dependencies

Accomplishments we're proud of

Fully functional, end-to-end RAG system running entirely locally
No external APIs required—complete data privacy
Modular, extensible architecture for easy improvements
Efficient FAISS-based vector search for scalability
Practical integration of LLaMA for high-quality text generation
Interactive user experience with source document attribution

What we learned

Architecture: Importance of modular design for maintainability and scalability
Vector Databases: Trade-offs between speed and accuracy in FAISS
Local LLMs: Value of self-hosted solutions for data privacy
Optimization: Balancing computational resources with performance
Prompt Engineering: Techniques to improve answer quality through context and phrasing
Retrieval Strategies: Advanced methods for better context relevance

What's next

Planned improvements:

Support for more document types (images via OCR)
Advanced chunking strategies for better retrieval
Multi-modal inputs and processing
Domain-specific LLaMA fine-tuning
Web UI for broader accessibility
Hybrid retrieval (keyword + semantic search)
Cloud deployment options
Evaluation metrics and user feedback loops

Built with

Languages & Frameworks:

Python
LangChain
Sentence Transformers
FAISS (Vector Database)
PyPDF
LlamaIndex
Ollama API

Platforms: Local execution (Windows/Linux/Mac)

Cloud Services: None (fully local)

Installation & Usage

Prerequisites

Python 3.8+
Ollama installed and running
LLaMA 3 model downloaded: ollama pull llama3

Setup

Clone the repository

git clone https://github.com/indrahacks/RAG_Hackathon.git
cd RAG_Hackathon

Create virtual environment

python -m venv .venv
.venv\Scripts\activate  # On Windows
source .venv/bin/activate  # On macOS/Linux

Install dependencies
```
pip install -r requirements.txt
```
Ensure Ollama is running
```
ollama serve
```
Add sample documents (place .txt or .pdf files in data/sample_documents/)
Run the pipeline
```
python main.py
```
Ask questions - Follow the interactive prompts to query your documents

Project Structure

rag_project/
├── main.py                    # Entry point
├── requirements.txt           # Python dependencies
├── config/
│   ├── __init__.py
│   └── settings.py           # Configuration settings
├── data/
│   ├── processed/            # (Reserved for processed data)
│   └── sample_documents/     # Place your .txt/.pdf files here
├── faiss_index/
│   └── index.faiss           # Persisted vector index
└── src/
    ├── __init__.py
    ├── data_loader.py        # Document loading logic
    ├── embeddings.py         # Embedding generation
    ├── vector_store.py       # FAISS integration
    ├── retriever.py          # Context retrieval
    ├── generator.py          # LLaMA integration
    └── rag_pipeline.py       # RAG chain orchestration

How to Present to a Judge

Setup Environment: Install dependencies, start Ollama, pull LLaMA 3
Prepare Data: Add sample documents to data/sample_documents/
Run the System: Execute python main.py to initialize the pipeline
Demonstrate Features:
- Show step-by-step initialization output
- Ask sample questions about documents
- Highlight retrieval accuracy and generation quality
- Display source documents used for answers
Key Points to Emphasize:
- Scalability: Handles multiple documents efficiently
- Privacy: No cloud services or external APIs
- Accuracy: Retrieval-augmented approach reduces hallucinations
- Local Execution: Runs entirely on-device for data security

Generative AI Implementation

LLaMA 3 is the core generative AI model used for the hardest task—converting retrieved context into natural language answers. It's used in:

src/generator.py: Model initialization and connection
src/rag_pipeline.py: Integration into RetrievalQA chain
main.py: Interactive question answering loop

The choice of LLaMA 3 ensures high-quality, coherent responses while maintaining full control over the model through local execution.

License

MIT License

Built with at the RAG Hackathon | GitHub Profile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative AI RAG Pipeline using LLaMA & FAISS

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments we're proud of

What we learned

What's next

Built with

Installation & Usage

Prerequisites

Setup

Project Structure

How to Present to a Judge

Generative AI Implementation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
data/sample_documents		data/sample_documents
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Generative AI RAG Pipeline using LLaMA & FAISS

Inspiration

What it does

How we built it

Challenges we ran into

Accomplishments we're proud of

What we learned

What's next

Built with

Installation & Usage

Prerequisites

Setup

Project Structure

How to Present to a Judge

Generative AI Implementation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages