Skip to content

patrickAdegbesan/PDF_READER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Reader Bot

A powerful Streamlit-based chatbot that enables interactive conversations with multiple PDF documents using advanced AI technologies. This application extracts text from uploaded PDFs, processes it into manageable chunks, and uses vector embeddings and conversational AI to answer questions about the content.

Features

  • Multi-PDF Support: Upload and process multiple PDF files simultaneously
  • AI-Powered Chat: Ask questions about your documents and get intelligent responses
  • Text Extraction: Automatically extracts text from PDF pages
  • Vector Search: Uses FAISS vector store for efficient document retrieval
  • Conversational Memory: Maintains context throughout the conversation
  • User-Friendly Interface: Clean Streamlit UI with chat-like messaging

Prerequisites

  • Python 3.8 or higher
  • OpenAI API key (required for embeddings and chat functionality)
  • Optional: HuggingFace API token (for alternative embedding models)

Installation

  1. Clone the repository:

    git clone <your-repository-url>
    cd pdfreader-bot
  2. Create a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install streamlit python-dotenv PyPDF2 langchain langchain-openai faiss-cpu
  4. Set up environment variables:

    • Copy the provided .env file or create a new one
    • Add your OpenAI API key:
      OPENAI_API_KEY=your_openai_api_key_here
      HUGGINGFACEHUB_API_TOKEN=your_huggingface_token_here  # Optional
      

Usage

  1. Start the application:

    streamlit run app.py
  2. Interact with the app:

    • Open your browser to the provided local URL (usually http://localhost:8501)
    • In the sidebar, upload one or more PDF files
    • Click the "Process" button to extract and index the text
    • Once processing is complete, start asking questions in the chat input
    • The AI will provide answers based on the content of your uploaded PDFs

How It Works

  1. Text Extraction: Uses PyPDF2 to extract text from each page of uploaded PDFs
  2. Text Chunking: Splits the extracted text into smaller chunks using LangChain's CharacterTextSplitter
  3. Embeddings: Creates vector embeddings using OpenAI's embedding model
  4. Vector Store: Stores embeddings in a FAISS vector database for efficient similarity search
  5. Conversational Chain: Uses LangChain's ConversationalRetrievalChain with ChatOpenAI for question answering
  6. Memory: Maintains conversation history for context-aware responses

Configuration

The application uses the following default settings (configurable in app.py):

  • Text chunk size: 1000 characters
  • Chunk overlap: 200 characters
  • Embedding model: OpenAI Embeddings (can be switched to HuggingFace Instructor embeddings)

Troubleshooting

  • API Key Issues: Ensure your OpenAI API key is correctly set in the .env file
  • PDF Processing Errors: Make sure your PDFs contain extractable text (not just images)
  • Memory Issues: For large PDFs, consider increasing system memory or reducing chunk size

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

About

PDF Reader Bot is a Streamlit-based chatbot application that enables users to upload multiple PDF documents and engage in AI-powered conversations about their content. It extracts text from PDFs, creates vector embeddings for efficient search, and uses OpenAI's language models to provide intelligent, context-aware answers to user questions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages