A Retrieval-Augmented Generation (RAG) application that allows users to chat with their PDF documents. This project uses LangChain, FAISS for vector storage, and integrates HuggingFace and Groq LLMs to provide accurtae answers based on document context.
- Document Ingestion: Loads and processes PDF files from a local directory.
- Text Splitting: Breaks down large documents into manageable chunks using
RecursiveCharacterTextSplitter. - Vector Embeddings: Uses
sentence-transformers/all-MiniLM-L6-v2to create semantic embeddings. - Vector Store: Stores embeddings locally using FAISS for fast similarity search.
- Multi-Interface:
- CLI Mode: Test retrieval and generation via the terminal.
- Web UI: A user-friendly chat interface built with Streamlit.
- LLM Integration: Supports HuggingFace Endpoints (Mistral) and
├── data/ # Directory to store input PDF files
├── vectorstore/ # Directory where FAISS index is saved
├── memory_llm.py # Script to ingest PDFs and create vector store
├── connect_memory_llm.py # Script to test RAG pipeline via CLI
├── docbot.py # Streamlit application for the Chatbot UI
├── requirements.txt # Python dependencies
└── .env # Environment variables (API Keys)
-
Python 3.10+
-
LangChain (Framework)
-
Streamlit (Frontend)
-
FAISS (Vector Database)
-
HuggingFace (Embeddings & LLM)
-
PDFPlumber (Document Loading)
git clone <repository-url>
cd <repository-folder>
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Create a requirements.txt file (if not present) with the following content, then install:
langchain
langchain-community
langchain-huggingface
langchain-groq
faiss-cpu
pdfplumber
streamlit
python-dotenv
huggingface_hub
Run command:
pip install -r requirements.txt
Create a .env file in the root directory and add your API keys:
HF_TOKEN=your_huggingface_access_token
GROQ_API_KEY=your_groq_api_key
- HF_TOKEN: Get it from HuggingFace Settings.
- GROQ_API_KEY: Get it from Groq Console.
Place your PDF files into the data/ folder. Then, run the ingestion script to create the vector database.
python memory_llm.py
This will create a vectorstore/db_faiss directory containing your embeddings.
To test if the retrieval is working correctly without the web UI:
python connect_memory_llm.py
Launch the Streamlit web interface:
streamlit run docbot.py
Open your browser at http://localhost:8501 to start chatting with your PDFs!
- Ingestion (
memory_llm.py): The script loads PDFs, splits text into 500-character chunks, converts them into vectors using HuggingFace embeddings, and saves them to a local FAISS index. - Retrieval: When a user asks a question, the system searches the FAISS index for the top 3 most similar document chunks.
- Generation (
docbot.py): The retrieved chunks + the user's question are sent to the LLM (via Groq API). The LLM generates a concise answer based strictly on the provided context.
- Model Selection: The
docbot.pyis currently configured to use Groq. Ensure your.envfile has a validGROQ_API_KEY. - Warnings: You may see "dangerous deserialization" warnings from FAISS. This is normal when loading local files you created yourself; the code includes
allow_dangerous_deserialization=Trueto handle this.