# PDF Highlighting Example

This notebook demonstrates how to:
1. Process a PDF document (extract, chunk, and embed text)
2. Answer a question using the processed PDF
3. Highlight the cited text in the original PDF

In [1]:
import os
from dotenv import load_dotenv
import json
import re

from openai import AsyncOpenAI
from utils.model_costs import ModelUsageAsync
from utils.pdf_ingestion import ingest_pdf
from utils.vector_store import VectorStore, get_query_embedding
from utils.openai_calls import call_openai_structured
from utils.pdf_highlighter import highlight_pdf_with_citations

# Load environment variables
load_dotenv()

# Initialize OpenAI client
openai_client = AsyncOpenAI(
    api_key=os.getenv("OPENAI_PROJECT_KEY"),
)

## 1. Process the PDF Document

First, we'll process the PDF to extract text, chunk it, and generate embeddings.

In [2]:
# Set path to your PDF file
pdf_path = "patent.pdf"

# Check if file exists
if not os.path.exists(pdf_path):
    print(f"PDF file not found: {pdf_path}")
    print(f"Current directory: {os.getcwd()}")
    
    # List PDF files in current directory
    pdf_files = [f for f in os.listdir() if f.endswith('.pdf')]
    if pdf_files:
        print("Available PDF files:")
        for pdf in pdf_files:
            print(f"- {pdf}")
        
        # Use the first available PDF
        pdf_path = pdf_files[0]
        print(f"Using: {pdf_path}")
    else:
        print("No PDF files found in current directory.")

In [3]:
# Initialize usage tracker
pdf_usage = ModelUsageAsync()

# Process the PDF: extract text, chunk, and embed
pdf_doc, chunks = await ingest_pdf(
    pdf_path=pdf_path,
    openai_client=openai_client,
    target_chunk_tokens=350,  # ~350 tokens per chunk as specified in tech spec
    chunk_overlap=0.3,        # 30% overlap as specified
    embedding_model="text-embedding-3-small",
    llm_usage=pdf_usage
)

# Print stats
print(f"Processed PDF: {pdf_doc.filename}")
print(f"Total pages: {len(pdf_doc.page_texts)}")
print(f"Total chunks: {len(chunks)}")
print(f"Embedding tokens used: {await pdf_usage.get_tokens_used()}")
print(f"Embedding cost: ${await pdf_usage.get_cost()}")

Processed PDF: patent.pdf
Total pages: 25
Total chunks: 142
Embedding tokens used: 41750
Embedding cost: $0.000835


## 2. Create Vector Store and Search

Now we'll create a vector store with the document chunks for semantic search.

In [4]:
# Create vector store
vector_store = VectorStore(embedding_dim=1536)  # dimension for text-embedding-3-small
vector_store.add_chunks(chunks)
print(f"Added {len(chunks)} chunks to vector store")

Added 142 chunks to vector store


## 3. Answer Questions with Citations

Now we'll create a function to answer questions with verbatim citations from the document.

In [5]:
async def answer_question(question: str):
    """Answer a question with citations from the PDF."""
    print(f"Question: {question}")
    
    # Get embedding for query
    query_usage = ModelUsageAsync()
    query_embedding = await get_query_embedding(
        query=question,
        openai_client=openai_client,
        embedding_model="text-embedding-3-small",
        llm_usage=query_usage
    )
    
    # Retrieve relevant chunks with Maximum Marginal Relevance for diversity
    retrieved_chunks = vector_store.mmr_search(
        query_embedding=query_embedding,
        k=6,  # Get top-6 chunks as specified in tech spec
        lambda_param=0.7  # Balance between relevance and diversity
    )
    
    print(f"Retrieved {len(retrieved_chunks)} relevant chunks")
    
    # Create context from chunks
    context_parts = []
    for chunk in retrieved_chunks:
        context_parts.append(f"Page {chunk.page_index + 1}:\n{chunk.text}\n")
    
    context = "\n".join(context_parts)
    
    # Create QA prompt
    qa_prompt = """
    Answer the question based ONLY on the provided context.
    Include verbatim quotes from the context to support your answer.
    
    Question: {question}
    
    Context:
    {context}
    
    Format your response as a JSON object with these fields:
    1. "answer": Your detailed answer to the question
    2. "citations": A list of citation objects, each with:
       - "page": The page number (integer)
       - "text": The exact quote from that page (string)
    
    Example format:
    {{"answer": "Your answer here...", "citations": [{{"page": 1, "text": "Exact quote from page 1"}}, {{"page": 2, "text": "Another quote from page 2"}}]}}
    """
    
    # Create message history
    message_history = [
        {"role": "system", "content": "You are an expert assistant that answers questions based solely on provided context."},
        {"role": "user", "content": qa_prompt.format(question=question, context=context)}
    ]
    
    # Get answer with o4-mini as specified in tech spec
    answer_usage = ModelUsageAsync()
    response = await call_openai_structured(
        openai_client=openai_client,
        model="o4-mini",
        messages=message_history,
        reasoning_effort="high",
        llm_usage=answer_usage
    )
    
    content = response.choices[0].message.content
    
    # Parse JSON response
    try:
        # Look for JSON object in the response
        json_match = re.search(r'\{.*\}', content, re.DOTALL)
        if json_match:
            result = json.loads(json_match.group(0))
        else:
            # Fallback parsing if not properly formatted
            result = {"answer": content, "citations": []}
    except json.JSONDecodeError:
        print("Failed to parse JSON response. Using raw content.")
        result = {"answer": content, "citations": []}
    
    # Calculate total usage
    total_tokens = await query_usage.get_tokens_used() + await answer_usage.get_tokens_used()
    total_cost = await query_usage.get_cost() + await answer_usage.get_cost()
    
    print(f"Total tokens used: {total_tokens}")
    print(f"Total cost: ${total_cost}")
    
    return result

## 4. Ask a Question

Let's ask a question about the PDF and get an answer with citations.

In [6]:
# Sample question about the document
question = "A method to develop a search engine rank for object-source pairs within a corpus of published documents, the method comprising: semantically identifying, by an evaluation module, objects and source values contained within the corpus of published documents, wherein each source value is a name of an organization, and wherein the objects and source values each include one or more words identified within a published document in the corpus of published documents tying, by the evaluation module, each instance of a first object throughout the corpus of published documents to a source value based on: identifying a first instance of the first object in a first published document of the corpus of published documents"

# Get answer with citations
result = await answer_question(question)

# Display the answer
print("\nAnswer:")
print(result["answer"])

# Display citations
print("\nCitations:")
for i, citation in enumerate(result["citations"]):
    print(f"Citation {i+1} - Page {citation['page']}:")
    print(f'"{citation["text"]}"')
    print()

Question: A method to develop a search engine rank for object-source pairs within a corpus of published documents, the method comprising: semantically identifying, by an evaluation module, objects and source values contained within the corpus of published documents, wherein each source value is a name of an organization, and wherein the objects and source values each include one or more words identified within a published document in the corpus of published documents tying, by the evaluation module, each instance of a first object throughout the corpus of published documents to a source value based on: identifying a first instance of the first object in a first published document of the corpus of published documents
Retrieved 6 relevant chunks
Total tokens used: 3617
Total cost: $0.008659479999999999

Answer:
The provided context does not disclose any method for “develop[ing] a search engine rank for object-source pairs” nor any evaluation module that semantically identifies objects an

## 5. Highlight Citations in the PDF

Now we'll use our PDF highlighter to display the document with highlighted citations.

In [7]:
from utils.pdf_highlighter import highlight_pdf_with_citations

print(f"Opening PDF: {pdf_path}")
print("The PDF viewer will open in a new window.")
print("- The answer is displayed in the top-right panel")
print("- Citations are listed below the answer")
print("- Click on a citation to navigate to that page")
print("- Citations are highlighted in bright yellow in the document")

# Open the PDF with the improved highlighter
highlight_pdf_with_citations(pdf_path, result["citations"], result["answer"])

Opening PDF: patent.pdf
The PDF viewer will open in a new window.
- The answer is displayed in the top-right panel
- Citations are listed below the answer
- Click on a citation to navigate to that page
- Citations are highlighted in bright yellow in the document
Found 0 citations for page 1
Page 1 has 3170 characters of text
Found 1 citations for page 24
Citation page numbers: [24]
Page 24 has 6494 characters of text
Looking for citation: selecting a source object from a plurality of obje...
Successfully highlighted using _highlight_with_exact_match
Found 1 citations for page 19
Citation page numbers: [19]
Page 19 has 7723 characters of text
Looking for citation: The importer 108 identifies 330 a title pattern an...
Successfully highlighted using _highlight_with_simplified_match
Found 1 citations for page 24
Citation page numbers: [24]
Page 24 has 6494 characters of text
Looking for citation: selecting a source object from a plurality of obje...
Successfully highlighted using _highligh