A retrieval-augmented generation (RAG) application that enables natural language interaction with your document collections. Built using Streamlit, LangChain, and vector embeddings.
DocuChat processes your documents and creates a searchable knowledge base. Users can ask questions in natural language and receive accurate answers grounded in the source material, complete with citations.
- Upload and process multiple document formats (PDF, TXT, DOCX)
- Semantic search using sentence transformers
- Context-aware responses with conversation history
- Source attribution for transparency
- Support for multiple LLM providers
- Simple web interface
Requires Python 3.8 or higher.
Clone the repository:
git clone https://github.com/yourusername/docuchat.git
cd docuchatCreate and activate a virtual environment:
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activateInstall dependencies:
pip install -r requirements.txtCreate a .env file in the root directory with your API credentials:
OPENAI_API_KEY=your_key_here
Alternatively, you can use Google's Gemini or Hugging Face models by providing the appropriate API keys.
Place your documents in the knowledge_base/ directory. Supported formats include PDF, plain text, and DOCX files.
Start the application:
streamlit run app.pyNavigate to http://localhost:8501 in your browser. Click the load button in the sidebar to index your documents, then begin asking questions.
docuchat/
├── app.py # Main application file
├── src/
│ ├── document_loader.py # Document processing
│ ├── embeddings.py # Vector store management
│ ├── retriever.py # Context retrieval
│ └── chat_handler.py # Response generation
├── knowledge_base/ # Document storage
├── vectorstore/ # Vector database
├── requirements.txt
└── .env
The system uses a multi-stage pipeline:
- Documents are split into manageable chunks with overlap to preserve context
- Text chunks are embedded using sentence-transformers
- Embeddings are stored in a ChromaDB vector database
- User queries are embedded and matched against the database
- Relevant contexts are retrieved and passed to the LLM
- The LLM generates responses based solely on retrieved information
You can modify various parameters in the source files:
- Adjust chunk size and overlap in
document_loader.py - Change the number of retrieved documents in
retriever.py - Switch between different embedding models in
embeddings.py - Modify the system prompt in
chat_handler.py
The application can be deployed to Streamlit Cloud or any platform supporting Python web applications. Remember to configure secrets/environment variables appropriately for production use.
Core libraries include:
- Streamlit for the web interface
- LangChain for RAG orchestration
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- PyPDF/Docx2txt for document parsing
- The first document indexing may take several minutes depending on collection size
- Vector store is persisted locally and can be rebuilt by deleting the
vectorstore/directory - Responses are limited to information contained in your documents
- Consider document quality and formatting for best results
MIT License
Contributions welcome. Please open an issue to discuss proposed changes before submitting pull requests.