AskDocs: RAG-Powered Document Q&A Chatbot

Project Overview

AskDocs is a lightweight, production-ready Retrieval-Augmented Generation (RAG) application that allows users to upload PDF documents and ask natural language questions about their content. The application uses semantic search to find relevant document sections and AI-powered LLM to generate accurate, contextual answers.

Core Features

Feature	Description
PDF Document Upload	Drag-and-drop interface for easy file upload
Semantic Search	Finds relevant document context using FAISS vector store
AI-Powered Answers	Uses Groq's Llama 3.1 model for fast, intelligent responses
Source Citations	Shows exactly which pages and sections answers come from
Document Summarization	Quick document overview with one-click summary
Chat History	Maintains conversation context across interactions
Clean UI	Modern, user-friendly Streamlit interface
Real-time Processing	Spinner indicator while AI processes your question

Technical Stack

Layer	Technologies
Frontend/UI	Streamlit
LLM Orchestration	LangChain, LangChain Classic
LLM Provider	Groq API
Model	Llama 3.1 8B Instant
Embeddings	Hugging Face (all-MiniLM-L6-v2)
Vector Store	FAISS
Document Loading	PyPDF
Text Processing	LangChain Text Splitters

Architecture Overview

flowchart TD
    A[User] -->|1. Upload PDF| B[Streamlit UI]
    B -->|2. Save to temp file| C[tempfile.NamedTemporaryFile]
    C --> D[PyPDFLoader]
    D -->|Load & parse document| E[RecursiveCharacterTextSplitter]
    E -->|Split into chunks| F[HuggingFaceEmbeddings]
    F -->|Generate embeddings| G[FAISS Vector Store]
    A -->|3. Ask Question| H[Streamlit UI]
    H --> I[ConversationalRetrievalChain]
    G -->|Retrieve relevant chunks| I
    Memory[ConversationBufferMemory] -->|Provide chat history| I
    I -->|Pass context + history + question| J[Groq LLM]
    J -->|Generate answer| K[Streamlit UI]
    K -->|Display answer + sources| A

Folder Structure

AskDocs/
├── app.py                      # Main Streamlit application
├── requirements.txt            # Project dependencies
├── .gitignore                  # Git ignore rules
├── README.md                   # This file
├── config/
│   └── settings.py             # Configuration and environment variables
└── core/
    ├── loader.py               # PDF loading and chunking
    ├── embeddings.py           # Embedding model initialization
    ├── vectorstore.py          # FAISS vector store creation
    └── chain.py                # RetrievalQA chain construction

Module Breakdown

config/settings.py

Centralized configuration management:

Environment loading: Loads variables from .env file
API key validation: Raises ValueError if GROQ_API_KEY is missing
Model settings: Configurable chunk size, overlap, and embedding model

core/loader.py

Responsible for document processing:

PDF loading: Uses PyPDFLoader to extract text
Text splitting: RecursiveCharacterTextSplitter (500 char chunks, 50 char overlap)
Summary extraction: get_summary_text() function extracts first N chunks for quick overview

core/embeddings.py

Embedding model initialization:

Uses all-MiniLM-L6-v2 for fast, efficient embeddings

core/vectorstore.py

Vector store and retriever setup:

Creates FAISS index from document chunks
Returns retriever object for semantic search

core/chain.py

QA chain construction:

Uses ConversationalRetrievalChain with ConversationBufferMemory for chat history
Strict prompt template to ensure answers only come from document context
Returns source documents for citation
Handles unanswerable questions gracefully

app.py

Main application:

Streamlit UI and state management
Chat history tracking
Question answering workflow
Source citation display

Installation & Configuration

Prerequisites

Python 3.9 or higher
Groq API key (get one at console.groq.com)

Step 1: Clone or Download the Project

cd c:\Users\yashk\Desktop\AskDocs

Step 2: Create a Virtual Environment (Recommended)

# Create virtual environment
python -m venv venv

# Activate virtual environment
.\venv\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Configure Environment Variables

Create a .env file in the project root directory:

GROQ_API_KEY=your_groq_api_key_here

Important: Replace your_groq_api_key_here with your actual Groq API key.

Running the Application

Development Mode

streamlit run app.py

The application will start and automatically open in your default browser at http://localhost:8501.

Usage Instructions

Upload a PDF document using the file uploader
Wait for the "PDF loaded. Ask your question below." success message
Optional: Click "📋 Summarize Document" to get a quick overview of the document
Type your question in the chat input box
Wait for the AI to process and respond
View the answer and click "📄 View Sources" to see citations

API Documentation

This application is a Streamlit web app and doesn't expose a traditional REST API. However, below is documentation of the core internal modules:

core.loader.load_and_chunk_pdf(file_path)

Loads and chunks a PDF document.

Parameters:

file_path (str): Path to the PDF file

Returns:

List[Document]: List of LangChain Document objects

Example:

from core.loader import load_and_chunk_pdf
chunks = load_and_chunk_pdf("document.pdf")

core.loader.get_summary_text(chunks, max_chunks=20)

Extracts combined text from first N chunks for document summary.

Parameters:

chunks (List[Document]): Document chunks from load_and_chunk_pdf
max_chunks (int): Maximum number of chunks to use for summary (default: 20)

Returns:

str: Combined text from selected chunks

Example:

from core.loader import load_and_chunk_pdf, get_summary_text
chunks = load_and_chunk_pdf("document.pdf")
summary_text = get_summary_text(chunks)

core.embeddings.get_embeddings()

Returns the configured embedding model.

Returns:

HuggingFaceEmbeddings: Embedding model instance

Example:

from core.embeddings import get_embeddings
embeddings = get_embeddings()

core.vectorstore.build_vectorstore(chunks, embeddings)

Creates a FAISS vector store and returns a retriever.

Parameters:

chunks (List[Document]): Document chunks from load_and_chunk_pdf
embeddings (HuggingFaceEmbeddings): Embedding model from get_embeddings

Returns:

VectorStoreRetriever: Configured retriever object

Example:

from core.vectorstore import build_vectorstore
retriever = build_vectorstore(chunks, embeddings)

core.chain.build_qa_chain(retriever)

Builds the ConversationalRetrievalChain chain with conversation memory.

Parameters:

retriever (VectorStoreRetriever): Retriever from build_vectorstore

Returns:

ConversationalRetrievalChain: Configured QA chain that accepts {"question": "..."} and maintains conversation history

Example:

from core.chain import build_qa_chain
qa_chain = build_qa_chain(retriever)
response = qa_chain.invoke({"question": "What is this document about?"})

Response Format:

{
    "question": "What is this document about?",
    "answer": "This document discusses...",
    "source_documents": [Document(...), Document(...)]
}

core.chain.summarize_document(text)

Generates a structured summary of document text using the LLM.

Parameters:

text (str): Document text to summarize

Returns:

str: Structured summary of the document

Example:

from core.chain import summarize_document
summary = summarize_document("This is a document about...")
print(summary)

Contributing Guidelines

We welcome contributions to AskDocs! Here's how you can help:

Development Workflow

Fork the Repository
- Create a personal fork of the project
Create a Feature Branch
```
git checkout -b feature/amazing-feature
```
Make Your Changes
- Follow the existing code style
- Add comments where necessary
- Test your changes thoroughly
Commit Your Changes
```
git commit -m "Add amazing feature"
```
Push to Your Branch
```
git push origin feature/amazing-feature
```
Open a Pull Request
- Describe your changes in detail
- Link any relevant issues

Code Style Guidelines

Follow PEP 8 guidelines
Use meaningful variable and function names
Keep functions focused and single-purpose
Add docstrings for public functions

Code of Conduct

Be respectful and inclusive
Welcome constructive feedback
Focus on what's best for the community
Show empathy towards other contributors

License

This project is licensed under the MIT License - see the LICENSE file for details (if LICENSE file doesn't exist, you may create one).

Version History

v1.0.1 (Latest)

Bug Fixes:

Fixed NameError: 'summarize_document' is not defined in app.py
Simplified summarization workflow to use get_summary_text directly

v1.0.0

Features:

Initial release of AskDocs
PDF document upload and processing
Semantic search with FAISS
AI-powered answers using Groq Llama 3.1
Source citations
Document Summarization
Chat history
Clean Streamlit UI

Improvements:

Refactored into modular structure (config/core separation)
Added GROQ_API_KEY validation at startup
Replaced hardcoded temp.pdf with tempfile.NamedTemporaryFile to prevent concurrent access conflicts
Added graceful unanswerable question handling
Improved UI with source expanders

Bug Fixes:

Fixed indentation issues in app.py
Removed duplicate chat history display code
Added proper temp file cleanup

Known Issues

Issue	Description	Workaround
Single document only	Currently supports only one uploaded PDF at a time	Reload app to upload a new document
No persistent storage	Vector store is in-memory only	No workaround yet (future improvement)
Large PDFs	Very large PDFs may take time to process	Consider splitting large PDFs into smaller files

Future Improvements

Support for multiple file formats (DOCX, TXT, EPUB, etc.)
Persistent vector storage (ChromaDB, Pinecone, etc.)
Multiple document upload and querying
Advanced chunking strategies (semantic, hierarchical)
Custom prompt templates
Export chat history
Docker containerization
Authentication and user accounts
Better error handling and user feedback

Security Considerations

API Key Management: GROQ_API_KEY loaded from environment variable, never hardcoded
Temporary File Cleanup: Uploaded files deleted after processing using try/finally
Input Validation: File uploader restricted to PDF files only
Prompt Injection Protection: Strict prompt template limits model to document context only
Dependencies: All dependencies listed in requirements.txt with no known vulnerabilities

Support & Troubleshooting

Common Issues

"GROQ_API_KEY environment variable is required"

Make sure you created a .env file with your API key
Restart the Streamlit app after setting the environment variable

"No module named '...'"

Make sure you activated your virtual environment
Run pip install -r requirements.txt

PDF won't upload or process

Make sure the file is a valid PDF
Try a different PDF file to rule out corruption

Getting Help

If you encounter issues:

Check the Known Issues section above
Review the Streamlit terminal output for error messages
Open an issue in the project repository

Made with ❤️ using Python, Streamlit, and LangChain

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.devcontainer		.devcontainer
config		config
core		core
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AskDocs: RAG-Powered Document Q&A Chatbot

Project Overview

Core Features

Technical Stack

Architecture Overview

Folder Structure

Module Breakdown

config/settings.py

core/loader.py

core/embeddings.py

core/vectorstore.py

core/chain.py

app.py

Installation & Configuration

Prerequisites

Step 1: Clone or Download the Project

Step 2: Create a Virtual Environment (Recommended)

Step 3: Install Dependencies

Step 4: Configure Environment Variables

Running the Application

Development Mode

Usage Instructions

API Documentation

core.loader.load_and_chunk_pdf(file_path)

core.loader.get_summary_text(chunks, max_chunks=20)

core.embeddings.get_embeddings()

core.vectorstore.build_vectorstore(chunks, embeddings)

core.chain.build_qa_chain(retriever)

core.chain.summarize_document(text)

Contributing Guidelines

Development Workflow

Code Style Guidelines

Code of Conduct

License

Version History

v1.0.1 (Latest)

v1.0.0

Known Issues

Future Improvements

Security Considerations

Support & Troubleshooting

Common Issues

Getting Help

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages