PDF RAG System

A Retrieval-Augmented Generation (RAG) system for answering questions about PDF documents, including password-protected PDFs.

Features

Extract text from PDF documents (including password-protected PDFs)
Process and chunk text for efficient retrieval
Create and persist vector embeddings using OpenAI embeddings
Answer questions about the PDF content using a retrieval-based approach
Interactive chat interface

Installation

Clone this repository:

git clone <repository-url>
cd rag_practive

Install the required dependencies:

pip install -r requirements.txt

Set up your OpenAI API key:

# Create a .env file
cp .env.example .env
# Edit the .env file to add your OpenAI API key

Usage

Process a PDF and Answer Questions

python -m pdf_rag.main --pdf path/to/your/pdf --password your-password --process --persist_dir ./data/vectorstore --question "Your question about the PDF"

Interactive Chat Interface

python -m pdf_rag.chat --pdf path/to/your/pdf --password your-password --persist_dir ./data/vectorstore

Web Interface

streamlit run app.py

Command-line Arguments

--pdf: Path to the PDF file (required)
--password: Password for the PDF file (if protected)
--process: Process the PDF and create a vector store
--persist_dir: Directory to persist the vector store
--question: Question to ask about the PDF (for main.py)

Example

# Process a PDF and create a vector store
python -m pdf_rag.main --pdf ./data/document.pdf --password HIMA1010 --process --persist_dir ./data/vectorstore

# Ask a question about the PDF
python -m pdf_rag.main --pdf ./data/document.pdf --password HIMA1010 --persist_dir ./data/vectorstore --question "What is this document about?"

# Start an interactive chat session
python -m pdf_rag.chat --pdf ./data/document.pdf --password HIMA1010 --persist_dir ./data/vectorstore

How It Works

PDF Text Extraction: The system extracts text from the PDF document, handling password protection if necessary.
Text Chunking: The extracted text is split into manageable chunks with some overlap to maintain context.
Vector Embedding: Each text chunk is converted into a vector embedding using OpenAI's embedding model.
Vector Storage: The embeddings are stored in a Chroma vector database for efficient retrieval.
Question Answering: When a question is asked, the system:
- Converts the question into an embedding
- Retrieves the most relevant text chunks from the vector store
- Sends the question and relevant context to a language model (GPT-4)
- Returns the generated answer

Requirements

Python 3.8+
OpenAI API key
Dependencies listed in requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
pdf_rag		pdf_rag
.gitignore		.gitignore
README.md		README.md
app.py		app.py
example.py		example.py
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF RAG System

Features

Installation

Usage

Process a PDF and Answer Questions

Interactive Chat Interface

Web Interface

Command-line Arguments

Example

How It Works

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

oldagecoder1/python-rag

Folders and files

Latest commit

History

Repository files navigation

PDF RAG System

Features

Installation

Usage

Process a PDF and Answer Questions

Interactive Chat Interface

Web Interface

Command-line Arguments

Example

How It Works

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages