DocuChat - Document-Based Question Answering System

A retrieval-augmented generation (RAG) application that enables natural language interaction with your document collections. Built using Streamlit, LangChain, and vector embeddings.

Overview

DocuChat processes your documents and creates a searchable knowledge base. Users can ask questions in natural language and receive accurate answers grounded in the source material, complete with citations.

Features

Upload and process multiple document formats (PDF, TXT, DOCX)
Semantic search using sentence transformers
Context-aware responses with conversation history
Source attribution for transparency
Support for multiple LLM providers
Simple web interface

Installation

Requires Python 3.8 or higher.

Clone the repository:

git clone https://github.com/yourusername/docuchat.git
cd docuchat

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configuration

Create a .env file in the root directory with your API credentials:

OPENAI_API_KEY=your_key_here

Alternatively, you can use Google's Gemini or Hugging Face models by providing the appropriate API keys.

Usage

Place your documents in the knowledge_base/ directory. Supported formats include PDF, plain text, and DOCX files.

Start the application:

streamlit run app.py

Navigate to http://localhost:8501 in your browser. Click the load button in the sidebar to index your documents, then begin asking questions.

Project Structure

docuchat/
├── app.py                    # Main application file
├── src/
│   ├── document_loader.py    # Document processing
│   ├── embeddings.py         # Vector store management
│   ├── retriever.py          # Context retrieval
│   └── chat_handler.py       # Response generation
├── knowledge_base/           # Document storage
├── vectorstore/             # Vector database
├── requirements.txt
└── .env

Technical Details

The system uses a multi-stage pipeline:

Documents are split into manageable chunks with overlap to preserve context
Text chunks are embedded using sentence-transformers
Embeddings are stored in a ChromaDB vector database
User queries are embedded and matched against the database
Relevant contexts are retrieved and passed to the LLM
The LLM generates responses based solely on retrieved information

Customization

You can modify various parameters in the source files:

Adjust chunk size and overlap in document_loader.py
Change the number of retrieved documents in retriever.py
Switch between different embedding models in embeddings.py
Modify the system prompt in chat_handler.py

Deployment

The application can be deployed to Streamlit Cloud or any platform supporting Python web applications. Remember to configure secrets/environment variables appropriately for production use.

Dependencies

Core libraries include:

Streamlit for the web interface
LangChain for RAG orchestration
ChromaDB for vector storage
Sentence Transformers for embeddings
PyPDF/Docx2txt for document parsing

Notes

The first document indexing may take several minutes depending on collection size
Vector store is persisted locally and can be rebuilt by deleting the vectorstore/ directory
Responses are limited to information contained in your documents
Consider document quality and formatting for best results

License

MIT License

Contributing

Contributions welcome. Please open an issue to discuss proposed changes before submitting pull requests.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
setup.md		setup.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocuChat - Document-Based Question Answering System

Overview

Features

Installation

Configuration

Usage

Project Structure

Technical Details

Customization

Deployment

Dependencies

Notes

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocuChat - Document-Based Question Answering System

Overview

Features

Installation

Configuration

Usage

Project Structure

Technical Details

Customization

Deployment

Dependencies

Notes

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages