A powerful Streamlit-based chatbot that enables interactive conversations with multiple PDF documents using advanced AI technologies. This application extracts text from uploaded PDFs, processes it into manageable chunks, and uses vector embeddings and conversational AI to answer questions about the content.
- Multi-PDF Support: Upload and process multiple PDF files simultaneously
- AI-Powered Chat: Ask questions about your documents and get intelligent responses
- Text Extraction: Automatically extracts text from PDF pages
- Vector Search: Uses FAISS vector store for efficient document retrieval
- Conversational Memory: Maintains context throughout the conversation
- User-Friendly Interface: Clean Streamlit UI with chat-like messaging
- Python 3.8 or higher
- OpenAI API key (required for embeddings and chat functionality)
- Optional: HuggingFace API token (for alternative embedding models)
-
Clone the repository:
git clone <your-repository-url> cd pdfreader-bot
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install streamlit python-dotenv PyPDF2 langchain langchain-openai faiss-cpu
-
Set up environment variables:
- Copy the provided
.envfile or create a new one - Add your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key_here HUGGINGFACEHUB_API_TOKEN=your_huggingface_token_here # Optional
- Copy the provided
-
Start the application:
streamlit run app.py
-
Interact with the app:
- Open your browser to the provided local URL (usually http://localhost:8501)
- In the sidebar, upload one or more PDF files
- Click the "Process" button to extract and index the text
- Once processing is complete, start asking questions in the chat input
- The AI will provide answers based on the content of your uploaded PDFs
- Text Extraction: Uses PyPDF2 to extract text from each page of uploaded PDFs
- Text Chunking: Splits the extracted text into smaller chunks using LangChain's CharacterTextSplitter
- Embeddings: Creates vector embeddings using OpenAI's embedding model
- Vector Store: Stores embeddings in a FAISS vector database for efficient similarity search
- Conversational Chain: Uses LangChain's ConversationalRetrievalChain with ChatOpenAI for question answering
- Memory: Maintains conversation history for context-aware responses
The application uses the following default settings (configurable in app.py):
- Text chunk size: 1000 characters
- Chunk overlap: 200 characters
- Embedding model: OpenAI Embeddings (can be switched to HuggingFace Instructor embeddings)
- API Key Issues: Ensure your OpenAI API key is correctly set in the
.envfile - PDF Processing Errors: Make sure your PDFs contain extractable text (not just images)
- Memory Issues: For large PDFs, consider increasing system memory or reducing chunk size
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.