# Web Ingestion Example

This notebook demonstrates how to use the web ingestion utilities to process a website URL and store the chunks in a vector store for question answering.

In [14]:
import os
from dotenv import load_dotenv

from openai import AsyncOpenAI
from pydantic import BaseModel

from utils.model_costs import ModelUsageAsync
from utils.openai_calls import call_openai_structured

# Add web-specific imports
from utils.web_ingestion import ingest_web_url, WebChunk, WebDocument
from utils.vector_store import VectorStore, get_query_embedding

load_dotenv() # .env should be in the root folder (sibling of this notebook)

load_dotenv()  # .env should be in the root folder (sibling of this notebook)

openai_client = AsyncOpenAI(
    api_key=os.getenv("OPENAI_PROJECT_KEY"),
)

In [15]:
import subprocess
subprocess.run(["python", "-m", "playwright", "install"])

# URL to process (replace with any URL you want to analyze)
url = "https://patents.google.com/patent/US8812435B1/en?oq=US8812435B1"

# Create usage tracker
web_ingestion_usage = ModelUsageAsync()

# Process the web page: extract text, chunk, and embed
web_doc, chunks = await ingest_web_url(
    url=url,
    openai_client=openai_client,
    target_chunk_tokens=350,  # As specified in the tech spec
    chunk_overlap=0.3,        # 30% overlap as specified
    embedding_model="text-embedding-3-small",
    llm_usage=web_ingestion_usage
)

# Print some stats
print(f"Processed URL: {web_doc.url}")
print(f"Page title: {web_doc.title}")
print(f"Created {len(chunks)} chunks")
print(f"Embedding tokens used: {await web_ingestion_usage.get_tokens_used()}")
print(f"Embedding cost: ${await web_ingestion_usage.get_cost()}")

# Display first chunk as example
if chunks:
    first_chunk = chunks[0]
    print("\nSample chunk:")
    print(f"URL: {first_chunk.url}")
    if first_chunk.xpath:
        print(f"XPath: {first_chunk.xpath}")
    if first_chunk.css_selector:
        print(f"CSS Selector: {first_chunk.css_selector}")
    print(f"Characters: {first_chunk.char_start} to {first_chunk.char_end}")
    print(f"Tokens: {first_chunk.tokens}")
    print(f"Text excerpt: {first_chunk.text[:200]}...")

Processed URL: https://patents.google.com/patent/US8812435B1/en?oq=US8812435B1
Extracted 51 chunks
Processed URL: https://patents.google.com/patent/US8812435B1/en?oq=US8812435B1
Page title: US8812435B1 - Learning objects and facts from documents - Google Patents
Created 51 chunks
Embedding tokens used: 16095
Embedding cost: $0.0003219

Sample chunk:
URL: https://patents.google.com/patent/US8812435B1/en?oq=US8812435B1
Characters: 0 to 1682
Tokens: 327
Text excerpt: This application is related to U.S. Utility patent application Ser. No. 11/394,610, entitled “Determining Document Subject by Using Title and Anchor Text of Related Documents,” by Shubin Zhao, filed o...


In [16]:
# Create vector store and add chunks
vector_store = VectorStore(embedding_dim=1536)  # dimension for text-embedding-3-small
vector_store.add_chunks(chunks)

print(f"Added {len(chunks)} chunks to vector store")

Added 51 chunks to vector store


In [17]:
# Define some test questions
test_questions = [
    "A method to develop a search engine rank for object-source pairs within a corpus of published documents, the method comprising: semantically identifying, by an evaluation module, objects and source values contained within the corpus of published documents, wherein each source value is a name of an organization, and wherein the objects and source values each include one or more words identified within a published document in the corpus of published documents tying, by the evaluation module, each instance of a first object throughout the corpus of published documents to a source value based on: identifying a first instance of the first object in a first published document of the corpus of published documents"
]

# Function to retrieve and display results
async def query_document(question):
    print(f"\nQuery: {question}")
    
    # Track usage
    query_usage = ModelUsageAsync()
    
    # Get embedding for query
    query_embedding = await get_query_embedding(
        query=question,
        openai_client=openai_client,
        embedding_model="text-embedding-3-small",
        llm_usage=query_usage
    )
    
    # Retrieve relevant chunks with MMR for diversity
    retrieved_chunks = vector_store.mmr_search(
        query_embedding=query_embedding,
        k=6,  # As specified in tech spec
        lambda_param=0.7  # Balance between relevance and diversity
    )
    
    print(f"Retrieved {len(retrieved_chunks)} relevant chunks")
    
    # Display retrieved chunks
    for i, chunk in enumerate(retrieved_chunks):
        print(f"\nChunk {i+1}:")
        # Display a preview of the text (first 100 characters)
        print(f"{chunk.text[:100]}...")
    
    print(f"\nEmbedding tokens used: {await query_usage.get_tokens_used()}")
    print(f"Embedding cost: ${await query_usage.get_cost()}")
    
    return retrieved_chunks

# Test with the first question
retrieved_chunks = await query_document(test_questions[0])


Query: A method to develop a search engine rank for object-source pairs within a corpus of published documents, the method comprising: semantically identifying, by an evaluation module, objects and source values contained within the corpus of published documents, wherein each source value is a name of an organization, and wherein the objects and source values each include one or more words identified within a published document in the corpus of published documents tying, by the evaluation module, each instance of a first object throughout the corpus of published documents to a source value based on: identifying a first instance of the first object in a first published document of the corpus of published documents
Retrieved 6 relevant chunks

Chunk 1:
., the Encyclopedia Britannica Online) when selecting 

 the source document.

Alternatively, the 

...

Chunk 2:
ribute-value pairs identified by applying the general contextual pattern may be over-inclusive. If t...

Chunk 3:
f a given 

In [18]:
# Create a prompt for question answering with citations
QA_WITH_CITATIONS_PROMPT = """
Task: Answer the user's question based ONLY on the provided context. 
Include verbatim quotes from the context to support your answer.
Format your answer with cited text in quotes and include the URL source for each citation.

User question: {user_question}

Context:
{context}

Your answer must:
1. Only contain information present in the context
2. Include at least 2 direct quotes from the context
3. Specify the URL source for each quote
4. Be concise and focused on the question
"""

async def answer_with_citations(question, retrieved_chunks):
    print(f"\nGenerating answer for: {question}")
    
    # Format context from chunks
    context_parts = []
    for i, chunk in enumerate(retrieved_chunks):
        context_parts.append(f"Source {i+1} - {chunk.url}:\n{chunk.text}\n")
    
    context = "\n".join(context_parts)
    
    # Create message history
    message_history = [
        {
            "role": "system",
            "content": "You are an expert assistant that answers questions based solely on provided context."
        },
        {
            "role": "user",
            "content": QA_WITH_CITATIONS_PROMPT.format(
                user_question=question,
                context=context
            )
        }
    ]
    
    # Track usage
    answer_usage = ModelUsageAsync()
    
    # Call LLM to generate answer
    model_response = await call_openai_structured(
        openai_client=openai_client,
        model="o4-mini",  # First call with o4-mini as specified
        messages=message_history,
        reasoning_effort="high",
        llm_usage=answer_usage
    )
    
    answer = model_response.choices[0].message.content
    
    print(f"\nAnswer:\n{answer}")
    print(f"\nTokens used: {await answer_usage.get_tokens_used()}")
    print(f"Answer cost: ${await answer_usage.get_cost()}")
    
    return answer

# Generate answer for the first question
answer = await answer_with_citations(test_questions[0], retrieved_chunks)


Generating answer for: A method to develop a search engine rank for object-source pairs within a corpus of published documents, the method comprising: semantically identifying, by an evaluation module, objects and source values contained within the corpus of published documents, wherein each source value is a name of an organization, and wherein the objects and source values each include one or more words identified within a published document in the corpus of published documents tying, by the evaluation module, each instance of a first object throughout the corpus of published documents to a source value based on: identifying a first instance of the first object in a first published document of the corpus of published documents

Answer:
The method uses an evaluation module to:

1. Semantically identify candidate source documents by keyword‐searching for each object name together with its attribute‐value pairs:  
   “Alternatively, the <…> searches for documents containing an object n

In [19]:
# Create a prompt for structured JSON output
JSON_FORMATTER_PROMPT = """
Format the following answer into a valid JSON structure with these fields:
1. "answer": The complete answer text
2. "citations": An array of citation objects, each with "url" and "text" fields

Original answer:
{answer}

Return ONLY the JSON object, nothing else.
"""

async def format_as_json(answer):
    print("\nFormatting answer as JSON...")
    
    # Create message history
    message_history = [
        {
            "role": "system",
            "content": "You are a helpful assistant that formats text as valid JSON."
        },
        {
            "role": "user",
            "content": JSON_FORMATTER_PROMPT.format(answer=answer)
        }
    ]
    
    # Track usage
    json_usage = ModelUsageAsync()
    
    # Call LLM to format as JSON
    model_response = await call_openai_structured(
        openai_client=openai_client,
        model="o4-mini",  # Using o4-mini for the formatting pass
        messages=message_history,
        reasoning_effort="medium",
        llm_usage=json_usage
    )
    
    json_response = model_response.choices[0].message.content
    
    print(f"\nJSON Output:\n{json_response}")
    print(f"\nTokens used: {await json_usage.get_tokens_used()}")
    print(f"JSON formatting cost: ${await json_usage.get_cost()}")
    
    return json_response

# Format the answer as JSON
json_output = await format_as_json(answer)


Formatting answer as JSON...

JSON Output:
```json
{
  "answer": "The method uses an evaluation module to:\n\n1. Semantically identify candidate source documents by keyword‐searching for each object name together with its attribute‐value pairs:\n   “Alternatively, the <…> searches for documents containing an object name of the source object and one or more of its facts (or attribute‐value pairs) in the plurality of documents.”\n   Source: https://patents.google.com/patent/US8812435B1/en?oq=US8812435B1\n\n2. Select the canonical source by checking the document title for the object name (and, if absent, repeating with another document):\n   “In one embodiment, the <…> determines whether the document title contains the source object name as a substring. A substring is a contiguous sequence of characters taken from a string. If the document title does not contain the source object name as a substring, the <…> repeats the above process and selects another document as the source document.”\

In [7]:
# Try another question
if len(test_questions) > 1:
    retrieved_chunks = await query_document(test_questions[1])
    answer = await answer_with_citations(test_questions[1], retrieved_chunks)
    json_output = await format_as_json(answer)


Query: What are the major applications of NLP?
Retrieved 6 relevant chunks

Chunk 1:
neural networks approach, using semantic networks[23] and word embeddings to capture semantic proper...

Chunk 2:
Field of linguistics and computer science

Natural language processing (NLP) is a subfield of comput...

Chunk 3:
h techniques[11][12] can achieve state-of-the-art results in many natural language tasks, e.g., in l...

Chunk 4:
tive AI".[57] Likewise, ideas of cognitive NLP are inherent to neural models multimodal NLP (althoug...

Chunk 5:
t the time.[4]

1970s: During the 1970s, many programmers began to write "conceptual ontologies", wh...

Chunk 6:
processing with the introduction of machine learning algorithms for language processing.  This was d...

Embedding tokens used: 9
Embedding cost: $1.8e-07

Generating answer for: What are the major applications of NLP?

Answer:
Here are some of the major real-world applications of NLP, drawn directly from the provided context:

1.   “Major ta