# Improved Search Across Multiple Documents with Highlights

This notebook demonstrates a streamlined approach to multi-document question-answering using Highlights:

1. **Scalability**: Processes collections of documents up to 2M tokens without context window limitations
2. **Precision**: Uses Highlights' contextual awareness to identify relevant passages across document sections
3. **Efficiency**: Optimizes both computational resources and token usage by focusing only on relevant sections

The implementation follows these steps:
1. Prepare and chunk multiple documents with appropriate metadata
2. Use Highlights API to retrieve contextually relevant chunks across all documents
3. Send only the most relevant chunks to OpenAI for response generation

## Setup

Install the required libraries.

In [None]:
!pip install openai python-dotenv datasets langchain_text_splitters

import os
import openai
from dotenv import load_dotenv
from typing import List, Dict
from base_client import HighlightsClient

## Loading Environment Variables

Create a .env file with your API keys:
```
HIGHLIGHTS_API_KEY=your-highlights-api-key
OPENAI_API_KEY=your-openai-api-key
```

In [2]:
# Load environment variables
load_dotenv()

# Initialize clients
highlights_client = HighlightsClient(api_key=os.getenv('HIGHLIGHTS_API_KEY'))
openai.api_key = os.getenv('OPENAI_API_KEY')

## Document Collection Processing

Multi-document search introduces new challenges around chunking and metadata preservation. The approach below handles these challenges by maintaining document identity while maintaining contextually-aware retrieval across chunks.

In [3]:
# Load document collection
from datasets import load_dataset
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Extract textbooks from specific authors as an example collection
ds = load_dataset("princeton-nlp/TextbookChapters")
documents = {
    item['path']: item['chapter']
    for item in ds['train']
    if "Suza_and_Lamkey" in item['path']
}

# Print summary of loaded documents
print(f"Loaded {len(documents)} chapters from textbooks by Suza and Lamkey")

  from .autonotebook import tqdm as notebook_tqdm
Repo card metadata block was not found. Setting CardData to empty.
Generating train split: 100%|██████████| 77932/77932 [00:00<00:00, 80239.69 examples/s]


Loaded 26 chapters from textbooks by Suza and Lamkey


## Document Chunking with Metadata Preservation

The approach below uses recursive splitting with chunk sizes (between 1000-10000 characters) and preserves document metadata in an XML-like format.

In [4]:
# Define an xml-style template for each chunk with text and metadata
text_chunk_template = """
<document>
    <metadata>
        <name>{document_name}</name>
        <chunk_id>{document_chunk_id}</chunk_id>
    </metadata>
    <content>{chapter_text}</content>
</document>
"""

# Use langchain's recursive text splitter as an example
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=8192,  # Balance context depth and retrieval granularity
    chunk_overlap=0,  # No overlap for this example, though 10-20% can improve context coherence
    separators=["\n\n", "\n", ". "]
)

text_chunks = []
for path, chapter in documents.items():
    chunks = text_splitter.split_text(chapter)

    for i, chunk in enumerate(chunks):
        document_text_chunk = text_chunk_template.format(
            document_name=path,
            chapter_text=chunk,
            document_chunk_id=i
        )
        text_chunks.append(document_text_chunk)

print(f"Split {len(documents)} chapters into {len(text_chunks)} chunks")

Split 26 chapters into 118 chunks


## Cross-Document Retrieval with Highlights

Unlike traditional vector search that evaluates chunks in isolation, Highlights analyzes each segment within its broader context. This approach significantly improves retrieval quality across diverse document collections.

In [5]:
query = "What are the different ways in which mutations can be classified?"

# Search for relevant chunks across all documents
highlights_response = highlights_client.search(
    query=query,
    chunk_txts=text_chunks,
    top_n=5,
)

relevant_passages = [result['chunk_txt'] for result in highlights_response['results']]
print(f"Retrieved {len(relevant_passages)} relevant passages from across {len(documents)} documents")

Retrieved 5 relevant passages from across 26 documents


## Response Generation with Contextual Awareness

By forwarding only the most relevant chunks to a frontier model, we achieve three key benefits:
1. Extended document coverage beyond single-document context windows
2. Reduced token consumption with corresponding cost savings
3. Higher quality responses by eliminating cross-document noise and irrelevant content

In [6]:
# Generate response using OpenAI with the consolidated context
context = '\n\n'.join(relevant_passages)
combined_prompt = f"""
Context information is below.
----------------
{context}
----------------
Using the above context, please answer the following question: {query}
"""

response = openai.chat.completions.create(
    model="gpt-4o-mini",  # or another appropriate model
    messages=[
        {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
        {"role": "user", "content": combined_prompt}
    ],
    temperature=0.7
)

print("\nGenerated Response:")
print(response.choices[0].message.content)


Generated Response:
Mutations can be classified in several ways, including:

1. **Causal Agent**: This classification distinguishes between spontaneous mutations, which occur naturally without intentional exposure to a mutagen, and induced mutations, which are caused by mutagens such as chemicals or radiation.

2. **Rate or Frequency of Occurrence**:
   - **Rare Mutations**: These occur infrequently in populations and are usually recessive, often hidden in heterozygotes.
   - **Recurrent Mutations**: These occur repeatedly and can influence gene frequency in populations.

3. **Kind of Tissue Involved and Inheritance Type**:
   - **Somatic Mutations**: Occur in somatic tissue and are not passed to offspring.
   - **Germinal (or Germ-Line) Mutations**: Occur in reproductive cells and can be inherited by future generations.

4. **Impact on Fitness or Function**:
   - **Deleterious Mutations**: Harmful mutations that decrease the fitness of an individual.
   - **Advantageous Mutations**: 