# Advanced RAG for Local PDFs using Gemini & GGUF Models

This notebook demonstrates an advanced RAG (Retrieval Augmented Generation) pipeline for local PDF documents, including those with tables. It uses a local GGUF model for inference and the Gemini API for intelligent, context-aware document processing.

**Key Features:**
- **PDF Processing:** Handles complex PDFs with text and tables using `PyMuPDF` (tables are converted to Markdown).
- **Context-Aware Splitting:** Uses the Gemini API to split the document into semantically coherent chunks, which is superior to fixed-size splitting.
- **High-Quality Embeddings:** Utilizes Google's `text-embedding-004` model, which aligns well with the Gemini splitter.
- **Local LLM:** Runs inference on a local GGUF model (like Qwen, Llama, Phi-3) via `LlamaCpp` for privacy and offline use.
- **CPU-Optimized:** Designed to run entirely on CPU with ~16GB RAM.
- **Persistent Vector DB:** Saves and loads the vector database to accelerate subsequent sessions.

In [1]:
# Step 1: Install all required packages
# This cell handles the installation of all necessary libraries, including the CPU-only version of PyTorch.
print("🔧 Installing necessary packages...")

# First, install CPU-only PyTorch to ensure no heavy CUDA dependencies are pulled in.
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu --no-cache-dir

# Then, install the core components for PDF processing, embeddings, vector stores, and the Gemini API.
!pip install langchain langchain-community sentence-transformers faiss-cpu PyMuPDF google-generativeai python-dotenv langchain-google-genai chromadb

# Separate installation for llama-cpp-python, as it can be complex.
# This tries a direct install first, which often works by pulling a pre-compiled wheel.
try:
    import llama_cpp
    print("✅ llama-cpp-python is already installed.")
except ImportError:
    print("Could not find llama-cpp-python, attempting to install...")
    # Note: On some systems, this might require build tools if a matching wheel isn't found.
    !pip install llama-cpp-python

print("✅ All dependencies should now be installed.")

🔧 Installing necessary packages...
Looking in indexes: https://download.pytorch.org/whl/cpu
Collecting PyMuPDF
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Collecting google-generativeai
  Using cached google_generativeai-0.8.5-py3-none-any.whl.metadata (3.9 kB)
Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.8-py3-none-any.whl.metadata (7.0 kB)
Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting google-ai-generativelanguage==0.6.15 (from google-generativeai)
  Using cached google_ai_generativelanguage-0.6.15-py3-none-any.whl.metadata (5.7 kB)
Collecting google-api-core (from google-generativeai)
  Using cached google_api_core-2.25.1-py3-none-any.whl.metadata (3.0 kB)
Collecting google-api-python-client (from google-generativeai)
  Downloading google_api_python_client-2.177.0-py3-none-any.whl.metadata (7.0 kB)
Collecting google-auth>=2.15.

In [2]:
# Step 2: Import all packages and set up the Google API Key
import os
import fitz  # PyMuPDF
import google.generativeai as genai
from getpass import getpass
from dotenv import load_dotenv

from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import LlamaCpp
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

print("🔑 Setting up Google API Key...")
load_dotenv() # Load variables from a .env file if it exists
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

if not GOOGLE_API_KEY:
    print("Google API Key not found in .env file.")
    try:
        GOOGLE_API_KEY = getpass("Please enter your Google API Key: ")
    except Exception as e:
        print(f"Could not read API key: {e}")
        GOOGLE_API_KEY = None

if GOOGLE_API_KEY:
    try:
        genai.configure(api_key=GOOGLE_API_KEY)
        print("✅ Google API Key configured successfully.")
    except Exception as e:
        print(f"❌ Failed to configure Google API: {e}")
else:
    print("❌ No Google API Key provided. PDF splitting will not work.")

🔑 Setting up Google API Key...
✅ Google API Key configured successfully.


## Step 3: PDF Processing with Gemini

Here, we define two helper functions:
1.  `extract_text_and_tables_from_pdf`: Uses `PyMuPDF` to read text and convert any tables into a Markdown format that the LLM can understand.
2.  `split_text_with_gemini`: Sends the extracted text (including Markdown tables) to the Gemini API to be split into semantically coherent chunks.

In [3]:
# Helper function to extract text and tables from a PDF
def extract_text_and_tables_from_pdf(pdf_path: str) -> str:
    """Extracts text and tables (as Markdown) from a PDF file using PyMuPDF."""
    print(f"📖 Reading text and tables from '{pdf_path}'...")
    if not os.path.exists(pdf_path):
        print(f"Error: File not found at {pdf_path}")
        return ""
    try:
        doc = fitz.open(pdf_path)
        full_content = ""
        for page_num, page in enumerate(doc):
            # Find tables on the page
            tables = page.find_tables()
            if tables:
                print(f"  > Found {len(tables.tables)} table(s) on page {page_num + 1}")
                for table in tables:
                    # Add a marker for the table and convert it to Markdown
                    table_data = table.extract()
                    if not table_data or not table_data[0]: continue
                    markdown_table = "| " + " | ".join(map(str, table_data[0])) + " |\n"
                    markdown_table += "| " + " | ".join(["---"] * len(table_data[0])) + " |\n"
                    for row in table_data[1:]:
                        markdown_table += "| " + " | ".join(map(str, row)) + " |\n"
                    full_content += f"\n--- TABLE START ---\n{markdown_table}--- TABLE END ---\n\n"

            # Add the plain text of the page
            full_content += page.get_text("text") + "\n"

        print("✅ Extraction of text and tables completed.")
        return full_content
    except Exception as e:
        print(f"Error processing PDF with PyMuPDF: {e}")
        return ""

# Helper function for context-aware splitting using Gemini
def split_text_with_gemini(text_to_split: str, model_name="gemini-1.5-flash") -> list[str]:
    """Uses Gemini to split text into semantically coherent sections."""
    if not text_to_split or not GOOGLE_API_KEY:
        print("❌ Cannot split text: No text provided or Google API Key is missing.")
        return []
    
    print(f"🧠 Using {model_name} for context-aware splitting...")
    prompt = f"""
    Your task is to reformat the following document text, which may include text and Markdown tables.
    Group related sentences and ideas into coherent paragraphs or sections.
    Each resulting section should represent a distinct topic or a continuous thought.
    Do NOT summarize, alter, or interpret the content, only restructure it based on semantic context.
    Preserve the Markdown tables as they are within their relevant context.
    Separate the resulting sections with a unique delimiter: '---CHUNK_SEPARATOR---'.

    Original Text:
    ---
    {text_to_split}
    ---
    """
    try:
        model = genai.GenerativeModel(model_name)
        response = model.generate_content(prompt)
        chunks = response.text.split('---CHUNK_SEPARATOR---')
        cleaned_chunks = [chunk.strip() for chunk in chunks if chunk.strip()]
        print(f"✅ Text split into {len(cleaned_chunks)} semantic chunks.")
        return cleaned_chunks
    except Exception as e:
        print(f"❌ Error during communication with Gemini API: {e}")
        return []

In [7]:
# Du benötigst diesen Import, falls er nicht schon global vorhanden ist
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Execute the PDF processing workflow with the improved splitting strategy

PDF_PATH = "simatic_S7_300.pdf"  # <<< IMPORTANT: SET THE PATH TO YOUR PDF FILE HERE

document_text = extract_text_and_tables_from_pdf(PDF_PATH)
context_aware_chunks = []

if document_text:
    print("\n--- Starting Improved Splitting Process ---")
    
    # 1. Grobes Vor-Chunking, um das API-Input-Limit zu umgehen
    # Wir erstellen größere Chunks, die sicher in das Kontextfenster von Gemini passen.
    # Ein Wert von 10.000 bis 15.000 ist oft ein guter Kompromiss.
    pre_splitter = RecursiveCharacterTextSplitter(
        chunk_size=12000, 
        chunk_overlap=200
    )
    pre_chunks = pre_splitter.split_text(document_text)
    print(f"📄 Document pre-split into {len(pre_chunks)} larger chunks for processing.")

    # 2. Semantisches Splitting für jeden Vor-Chunk in einer Schleife
    final_chunks = []
    for i, pre_chunk in enumerate(pre_chunks):
        print(f"\n🧠 Processing pre-chunk {i + 1} of {len(pre_chunks)} with Gemini...")
        
        # Rufe die Gemini-Funktion für jeden einzelnen Vor-Chunk auf
        semantic_chunks_from_pre_chunk = split_text_with_gemini(pre_chunk)
        
        if semantic_chunks_from_pre_chunk:
            final_chunks.extend(semantic_chunks_from_pre_chunk)
            print(f"  > Added {len(semantic_chunks_from_pre_chunk)} semantic chunks.")
        else:
            print("  > No chunks returned for this pre-chunk.")
            # Optional: Füge den Vor-Chunk als Ganzes hinzu, wenn Gemini fehlschlägt
            # final_chunks.append(pre_chunk) 

    context_aware_chunks = final_chunks
    print(f"\n\n✅🏁 Total semantic chunks created: {len(context_aware_chunks)}")
else:
    print("⚠️ No document text to process.")

📖 Reading text and tables from 'simatic_S7_300.pdf'...
  > Found 15 table(s) on page 1
  > Found 1 table(s) on page 2
  > Found 0 table(s) on page 3
  > Found 0 table(s) on page 4
  > Found 1 table(s) on page 5
  > Found 1 table(s) on page 6
  > Found 0 table(s) on page 7
  > Found 0 table(s) on page 8
  > Found 0 table(s) on page 9
  > Found 0 table(s) on page 10
  > Found 0 table(s) on page 11
  > Found 0 table(s) on page 12
  > Found 0 table(s) on page 13
  > Found 0 table(s) on page 14
  > Found 0 table(s) on page 15
  > Found 0 table(s) on page 16
  > Found 0 table(s) on page 17
  > Found 0 table(s) on page 18
  > Found 0 table(s) on page 19
  > Found 1 table(s) on page 20
  > Found 1 table(s) on page 21
  > Found 4 table(s) on page 22
  > Found 1 table(s) on page 23
  > Found 1 table(s) on page 24
  > Found 2 table(s) on page 25
  > Found 1 table(s) on page 26
  > Found 1 table(s) on page 27
  > Found 1 table(s) on page 28
  > Found 2 table(s) on page 29
  > Found 0 table(s) on p

## Step 4: Create Vector Database

Now we'll use Google's high-quality embedding model to convert our semantic chunks into vectors and store them in a local ChromaDB database. This database will be saved to disk to avoid re-processing in the future.

In [8]:
# Step 4: Create or Load Vector Database with FAISS

from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

VECTOR_DB_PATH = "./pdf_faiss_db"  # Wir nennen den Ordner um, um Verwechslungen zu vermeiden
db = None

if GOOGLE_API_KEY and context_aware_chunks:
    # Lade das Google Embedding-Modell
    print("🧠 Loading Google's text-embedding-004 model...")
    embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004", google_api_key=GOOGLE_API_KEY)
    
    # Prüfe, ob eine bestehende FAISS-Datenbank geladen werden kann
    if os.path.exists(VECTOR_DB_PATH):
        print(f"📂 Found existing FAISS database at '{VECTOR_DB_PATH}'. Loading...")
        try:
            db = FAISS.load_local(
                folder_path=VECTOR_DB_PATH,
                embeddings=embeddings,
                allow_dangerous_deserialization=True # Nötig für das Laden von FAISS-Indizes
            )
            print(f"✅ Vector database loaded successfully with {db.index.ntotal} chunks.")
        except Exception as e:
            print(f"❌ Failed to load existing database: {e}. A new one will be created.")
            db = None # Setze db zurück, damit eine neue DB erstellt wird
    
    # Erstelle eine neue Datenbank, wenn keine geladen werden konnte
    if db is None:
        print(f"🔗 Creating new FAISS vector database...")
        print("⏱️ This can take a moment, depending on the number of chunks...")
        
        db = FAISS.from_texts(
            texts=context_aware_chunks, 
            embedding=embeddings
        )
        
        print("💾 Saving new vector database for future sessions...")
        db.save_local(folder_path=VECTOR_DB_PATH)
        print(f"✅ Vector database created and saved to '{VECTOR_DB_PATH}' with {db.index.ntotal} chunks.")

else:
    print("⚠️ Skipping vector database creation: No chunks or Google API key available.")

🧠 Loading Google's text-embedding-004 model...
🔗 Creating new FAISS vector database...
⏱️ This can take a moment, depending on the number of chunks...
💾 Saving new vector database for future sessions...
✅ Vector database created and saved to './pdf_faiss_db' with 1138 chunks.


In [9]:
# Configure the retriever
retriever = None
if db:
    retriever = db.as_retriever(
        search_type="similarity",
        search_kwargs={'k': 4}  # Return top 4 most relevant chunks
    )
    print("✅ Retriever configured.")
else:
    print("❌ Could not configure retriever because the vector database is not available.")

✅ Retriever configured.


## Step 5: Load Local GGUF Model

This section loads your local GGUF model using `LlamaCpp`. Please adjust the `model_path` to point to your model file.

In [10]:
# --- IMPORTANT: SET THE PATH TO YOUR GGUF MODEL FILE HERE ---
# GGUF_MODEL_PATH = r"C:\Users\User\Path\To\Your\Model.gguf" # <<< WINDOWS EXAMPLE
GGUF_MODEL_PATH = "models/Qwen3-8B-GGUF/Qwen3-8B-Q6_K.gguf" # <<< LINUX/MAC EXAMPLE

llm = None
print(f"🤖 Attempting to load GGUF model from: {GGUF_MODEL_PATH}")

if os.path.exists(GGUF_MODEL_PATH):
    try:
        llm = LlamaCpp(
            model_path=GGUF_MODEL_PATH,
            temperature=0.2,
            max_tokens=512,
            top_p=0.9,
            n_ctx=4096,
            n_batch=512,
            verbose=False,
            n_gpu_layers=0  # CPU only
        )
        print("✅ GGUF model loaded successfully!")
    except Exception as e:
        print(f"❌ Failed to load model: {e}")
        llm = None
else:
    print(f"❌ Model file not found at the specified path. Please update GGUF_MODEL_PATH.")

🤖 Attempting to load GGUF model from: models/Qwen3-8B-GGUF/Qwen3-8B-Q6_K.gguf


llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


✅ GGUF model loaded successfully!


## Step 6: Setup and Run the RAG Chain

Finally, we assemble the RAG chain and test it with a question related to your PDF's content.

In [None]:
# Create prompt template. Adjust for your model's format if needed (e.g., Qwen3, Llama3, Phi3).
# This template is generic but can be adapted.
prompt_template = """<|im_start|>system
You are a helpful assistant. Answer the user's question based only on the provided context from the document. If the answer is not in the context, clearly state that.
Context:
{context}<|im_end|>
<|im_start|>user
Question: {question}<|im_end|>
<|im_start|>assistant
"""
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template
)

# Create the complete RAG chain
rag_chain = None
if retriever and llm:
    rag_chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )
    print("✅ RAG chain created and ready to use!")
else:
    print("❌ RAG chain could not be created. Check previous steps for errors.")

In [None]:
# Test the RAG chain with a question
if rag_chain:
    # --- IMPORTANT: ASK A QUESTION RELEVANT TO YOUR PDF'S CONTENT ---
    question = "What is the main topic discussed in the document?" # <<< CHANGE THIS QUESTION
    
    print(f"\n❓ Asking question: {question}")
    print("="*50)
    
    # Invoke the chain to get an answer
    answer = rag_chain.invoke(question)
    
    print("💬 Answer:")
    print(answer)
else:
    print("⚠️ Cannot ask question because the RAG chain was not set up correctly.")

In [None]:
# Another example question
if rag_chain:
    question = "Summarize the key findings from any tables in the document." # <<< CHANGE THIS QUESTION
    
    print(f"\n❓ Asking question: {question}")
    print("="*50)
    
    answer = rag_chain.invoke(question)
    
    print("💬 Answer:")
    print(answer)