# PDF Ingestion Example

This notebook demonstrates how to use the PDF ingestion utilities to process a PDF file and store the chunks in a vector store.

In [1]:
import os
from dotenv import load_dotenv

from openai import AsyncOpenAI
from pydantic import BaseModel

from utils.model_costs import ModelUsageAsync
from utils.openai_calls import call_openai_structured

In [4]:
# Add these to your existing imports
from utils.pdf_ingestion import ingest_pdf, PDFChunk, PDFDocument
from utils.vector_store import VectorStore, get_query_embedding

In [2]:
load_dotenv() # .env should be in the root folder (sibling of this notebook)

openai_client = AsyncOpenAI(
    api_key=os.getenv("OPENAI_PROJECT_KEY"),
)

In [5]:
# Path to your PDF file
pdf_path = "basic_laws.pdf"

# Create usage tracker
pdf_ingestion_usage = ModelUsageAsync()

# Process the PDF: extract text, chunk, and embed
pdf_doc, chunks = await ingest_pdf(
    pdf_path=pdf_path,
    openai_client=openai_client,
    target_chunk_tokens=350,  # As specified in the tech spec
    chunk_overlap=0.3,        # 30% overlap as specified
    embedding_model="text-embedding-3-small",
    llm_usage=pdf_ingestion_usage
)

# Print some stats
print(f"Processed PDF: {pdf_doc.filename}")
print(f"Extracted {len(pdf_doc.page_texts)} pages")
print(f"Created {len(chunks)} chunks")
print(f"Embedding tokens used: {await pdf_ingestion_usage.get_tokens_used()}")
print(f"Embedding cost: ${await pdf_ingestion_usage.get_cost()}")

# Display first chunk as example
if chunks:
    first_chunk = chunks[0]
    print("\nSample chunk:")
    print(f"Page: {first_chunk.page_index + 1}")
    print(f"Characters: {first_chunk.char_start} to {first_chunk.char_end}")
    print(f"Tokens: {first_chunk.tokens}")
    print(f"Text excerpt: {first_chunk.text[:200]}...")

Processed PDF: basic_laws.pdf
Extracted 170 pages
Created 169 chunks
Embedding tokens used: 136620
Embedding cost: $0.0027324

Sample chunk:
Page: 1
Characters: 0 to 231
Tokens: 53
Text excerpt: 2016 edition
BASIC
LAWS
and AUTHORITIES of the NATIONAL ARCHIVES
and RECORDS ADMINISTR ATION
Office of General Counsel
National Archives and Records Administration
www.archives.gov
Additional material...


In [6]:
# Create vector store and add chunks
vector_store = VectorStore(embedding_dim=1536)  # dimension for text-embedding-3-small
vector_store.add_chunks(chunks)

print(f"Added {len(chunks)} chunks to vector store")

Added 169 chunks to vector store


In [19]:
# Define some test questions
test_questions = [
    "How and when can the President dispose of Presidential records?",
    "What proivsions are teh Vice-Presidential records subject to?",
    # Add more relevant questions about your PDF content
]

# Function to retrieve and display results
async def query_document(question):
    print(f"\nQuery: {question}")
    
    # Track usage
    query_usage = ModelUsageAsync()
    
    # Get embedding for query
    query_embedding = await get_query_embedding(
        query=question,
        openai_client=openai_client,
        embedding_model="text-embedding-3-small",
        llm_usage=query_usage
    )
    
    # Retrieve relevant chunks with MMR for diversity
    retrieved_chunks = vector_store.mmr_search(
        query_embedding=query_embedding,
        k=6,  # As specified in tech spec
        lambda_param=0.7  # Balance between relevance and diversity
    )
    
    print(f"Retrieved {len(retrieved_chunks)} relevant chunks")
    
    # Display retrieved chunks
    for i, chunk in enumerate(retrieved_chunks):
        print(f"\nChunk {i+1} (Page {chunk.page_index + 1}):")
        # Display a preview of the text (first 100 characters)
        print(f"{chunk.text[:100]}...")
    
    print(f"\nEmbedding tokens used: {await query_usage.get_tokens_used()}")
    print(f"Embedding cost: ${await query_usage.get_cost()}")
    
    return retrieved_chunks

# Test with the first question
retrieved_chunks = await query_document(test_questions[0])


Query: How and when can the President dispose of Presidential records?
Retrieved 6 relevant chunks

Chunk 1 (Page 50):
(4) The term “Archivist” means the Archivist of the mittees at least 60 calendar days of continuous ...

Chunk 2 (Page 122):
EXECUTIVE ORDER 13489—
PRESIDENTIAL RECORDS
By the authority vested in me as President by the Sec. 2...

Chunk 3 (Page 51):
(3) The Archivist is authorized to dispose of such Pres- (ii) the expiration of the duration specifi...

Chunk 4 (Page 42):
(ii) any personnel with appropriate security clearances 44 U.S.C. § 2111 NOTE
of a Federal contracto...

Chunk 5 (Page 68):
the end of the periods specified, have sufficient admin- When the Archivist and the head of the agen...

Chunk 6 (Page 5):
FEDERAL REGISTER AND THE CODE OF FEDERAL REGULATIONS 11
§ 1501. Definitions 12
§ 1502. Custody and p...

Embedding tokens used: 11
Embedding cost: $2.2e-07


In [18]:
# Create a prompt for question answering with citations
QA_WITH_CITATIONS_PROMPT = """
Task: Answer the user's question based ONLY on the provided context. 
Include verbatim quotes from the context to support your answer.
Format your answer with cited text in quotes and include the page number in parentheses.

User question: {user_question}

Context:
{context}

Your answer must:
1. Only contain information present in the context
2. Include at least 2 direct quotes from the context
3. Specify the page number for each quote in parentheses
4. Be concise and focused on the question
"""

async def answer_with_citations(question, retrieved_chunks):
    print(f"\nGenerating answer for: {question}")
    
    # Format context from chunks
    context_parts = []
    for i, chunk in enumerate(retrieved_chunks):
        context_parts.append(f"Page {chunk.page_index + 1}:\n{chunk.text}\n")
    
    context = "\n".join(context_parts)
    
    # Create message history
    message_history = [
        {
            "role": "system",
            "content": "You are an expert assistant that answers questions based solely on provided context."
        },
        {
            "role": "user",
            "content": QA_WITH_CITATIONS_PROMPT.format(
                user_question=question,
                context=context
            )
        }
    ]
    
    # Track usage
    answer_usage = ModelUsageAsync()
    
    # Call LLM to generate answer
    model_response = await call_openai_structured(
        openai_client=openai_client,
        model="o4-mini",  # First call with o4-mini as specified
        messages=message_history,
        reasoning_effort="high",
        llm_usage=answer_usage
    )
    
    answer = model_response.choices[0].message.content
    
    print(f"\nAnswer:\n{answer}")
    print(f"\nTokens used: {await answer_usage.get_tokens_used()}")
    print(f"Answer cost: ${await answer_usage.get_cost()}")
    
    return answer

# Generate answer for the first question
answer = await answer_with_citations(test_questions[0], retrieved_chunks)


Generating answer for: How and when can the President dispose of Presidential records?

Answer:
The Presidential Records Act authorizes the President, during his term of office, to discard his own records that “no longer have administrative, historical, informational, or evidentiary value” so long as he first secures the Archivist’s written views and a statement that no action under § 2203(e) will be taken:  
“During the President’s term of office, the President may dispose of those Presidential records … if (1) the President obtains the views, in writing, of the Archivist concerning the proposed disposal of such Presidential records; and (2) the Archivist states that the Archivist does not intend to take any action under subsection (e) of this section.” (page 41)  

If the Archivist does object (i.e., notifies the President of an intent to take action under § 2203(e)), the President may still dispose of the records only after providing copies of the disposal schedule to the appropria

In [20]:
# Run through all test questions
for question in test_questions[1:]:  # Skip the first one we already did
    retrieved_chunks = await query_document(question)
    answer = await answer_with_citations(question, retrieved_chunks)
    print("\n" + "-"*80 + "\n")


Query: What proivsions are teh Vice-Presidential records subject to?
Retrieved 6 relevant chunks

Chunk 1 (Page 52):
(2) Nothing in this Act shall be construed to confirm, shall be available to such former President o...

Chunk 2 (Page 5):
FEDERAL REGISTER AND THE CODE OF FEDERAL REGULATIONS 11
§ 1501. Definitions 12
§ 1502. Custody and p...

Chunk 3 (Page 42):
(ii) any personnel with appropriate security clearances 44 U.S.C. § 2111 NOTE
of a Federal contracto...

Chunk 4 (Page 122):
EXECUTIVE ORDER 13489—
PRESIDENTIAL RECORDS
By the authority vested in me as President by the Sec. 2...

Chunk 5 (Page 49):
§ 2120. ONLINE ACCESS OF FOUNDING (3) Thomas Jefferson;
FATHERS DOCUMENTS (4) Benjamin Franklin;
The...

Chunk 6 (Page 51):
(3) The Archivist is authorized to dispose of such Pres- (ii) the expiration of the duration specifi...

Embedding tokens used: 15
Embedding cost: $3e-07

Generating answer for: What proivsions are teh Vice-Presidential records subject to?

Answer:
“Vice-Preside