Skip to content

harshitha-8/DocuChat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocuChat - Document-Based Question Answering System

A retrieval-augmented generation (RAG) application that enables natural language interaction with your document collections. Built using Streamlit, LangChain, and vector embeddings.

Overview

DocuChat processes your documents and creates a searchable knowledge base. Users can ask questions in natural language and receive accurate answers grounded in the source material, complete with citations.

Features

  • Upload and process multiple document formats (PDF, TXT, DOCX)
  • Semantic search using sentence transformers
  • Context-aware responses with conversation history
  • Source attribution for transparency
  • Support for multiple LLM providers
  • Simple web interface

Installation

Requires Python 3.8 or higher.

Clone the repository:

git clone https://github.com/yourusername/docuchat.git
cd docuchat

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Configuration

Create a .env file in the root directory with your API credentials:

OPENAI_API_KEY=your_key_here

Alternatively, you can use Google's Gemini or Hugging Face models by providing the appropriate API keys.

Usage

Place your documents in the knowledge_base/ directory. Supported formats include PDF, plain text, and DOCX files.

Start the application:

streamlit run app.py

Navigate to http://localhost:8501 in your browser. Click the load button in the sidebar to index your documents, then begin asking questions.

Project Structure

docuchat/
├── app.py                    # Main application file
├── src/
│   ├── document_loader.py    # Document processing
│   ├── embeddings.py         # Vector store management
│   ├── retriever.py          # Context retrieval
│   └── chat_handler.py       # Response generation
├── knowledge_base/           # Document storage
├── vectorstore/             # Vector database
├── requirements.txt
└── .env

Technical Details

The system uses a multi-stage pipeline:

  1. Documents are split into manageable chunks with overlap to preserve context
  2. Text chunks are embedded using sentence-transformers
  3. Embeddings are stored in a ChromaDB vector database
  4. User queries are embedded and matched against the database
  5. Relevant contexts are retrieved and passed to the LLM
  6. The LLM generates responses based solely on retrieved information

Customization

You can modify various parameters in the source files:

  • Adjust chunk size and overlap in document_loader.py
  • Change the number of retrieved documents in retriever.py
  • Switch between different embedding models in embeddings.py
  • Modify the system prompt in chat_handler.py

Deployment

The application can be deployed to Streamlit Cloud or any platform supporting Python web applications. Remember to configure secrets/environment variables appropriately for production use.

Dependencies

Core libraries include:

  • Streamlit for the web interface
  • LangChain for RAG orchestration
  • ChromaDB for vector storage
  • Sentence Transformers for embeddings
  • PyPDF/Docx2txt for document parsing

Notes

  • The first document indexing may take several minutes depending on collection size
  • Vector store is persisted locally and can be rebuilt by deleting the vectorstore/ directory
  • Responses are limited to information contained in your documents
  • Consider document quality and formatting for best results

License

MIT License

Contributing

Contributions welcome. Please open an issue to discuss proposed changes before submitting pull requests.

About

A document-based question answering system using retrieval augmented generation (RAG). Ask questions in natural language and get answers based on your document collection.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages