# Multi-Document Search and Question Answering with Highlights and OpenAI Integration

This notebook demonstrates how to:
1. Prepare documents for search
3. Use Highlights API to search for relevant chunks
4. Send the most relevant chunks to OpenAI for text generation

## Setup

First, let's install and import the required libraries.

In [None]:
!pip install openai python-dotenv datasets langchain_text_splitters

import os
import openai
from dotenv import load_dotenv
from typing import List, Dict
from base_client import HighlightsClient


## Loading Environment Variables

Create a .env file with your API keys:
```
HIGHLIGHTS_API_KEY=your-highlights-api-key
OPENAI_API_KEY=your-openai-api-key
```

In [6]:
# Load environment variables
load_dotenv()

# Initialize clients
highlights_client = HighlightsClient(api_key=os.getenv('HIGHLIGHTS_API_KEY'))
openai.api_key = os.getenv('OPENAI_API_KEY')

## Preparing Textbooks for Search

First, let's download some horticulture textbook chapters from the princeton-nlp dataset on Hugging Face.

In [None]:
# Initialize processor and load dataset
from datasets import load_dataset
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Extract textbooks from specific authors
ds = load_dataset("princeton-nlp/TextbookChapters")
documents = {
    item['path']: item['chapter']
    for item in ds['train']
    if "Suza_and_Lamkey" in item['path']
}

# Print summary of loaded documents
print(f"Loaded {len(documents)} chapters from textbooks by Suza and Lamkey")


Now, lets split the chapters into manageable chunks with helpful metadata. It's recommended to keep chunks between 1000 and 10000 characters.

In [3]:
# Define an xml-style template for each chunk with text and metadata
text_chunk_template = """
<document>
    <metadata>
        <name>{document_name}</name>
        <chunk_id>{document_chunk_id}</chunk_id>
    </metadata>
    <content>{chapter_text}</content>
</document>
"""

# Use langchain's recursive text splitter to split the chapters into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=8192, chunk_overlap=0, separators=["\n\n", "\n", ". "])

text_chunks = []
for path, chapter in documents.items():
    chunks = text_splitter.split_text(chapter)

    for i, chunk in enumerate(chunks):
        document_text_chunk = text_chunk_template.format(document_name=path, chapter_text=chunk, document_chunk_id=i)
        text_chunks.append(document_text_chunk)

print(f"Split {len(documents)} chapters into {len(text_chunks)} chunks")

Split 26 chapters into 118 chunks


## Utilizing Highlights to retrieve relevant chunks

Let's search for relevant chunks given the query.

In [7]:
query = "What are the different ways in which mutations can be classified?"

# Search for relevant chunks
highlights_response = highlights_client.search(
    query=query,
    chunk_txts=text_chunks,
    top_n=5
)

relevent_passages = [result['chunk_txt'] for result in highlights_response['results']]


# Using OpenAI with Highlights response
Highlights returns the original chunks that we passed in so forwarding those into OpenAI 

In [8]:
# Generate response using OpenAI
context = '\n\n'.join(relevent_passages)
combined_prompt = f"""
Context information is below.
----------------
{context}
----------------
Using the above context, please answer the following question: {query}
"""

response = openai.chat.completions.create(
    model="gpt-4o-mini",  # or another appropriate model
    messages=[
        {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
        {"role": "user", "content": combined_prompt}
    ],
    temperature=0.7
)


print("\nGenerated Response:")
print(response.choices[0].message.content)


Generated Response:
Mutations can be classified in several ways based on different criteria:

1. **Causal Agent**:
   - Spontaneous Mutations: Occur naturally without intentional exposure to a mutagen.
   - Induced Mutations: Caused by mutagens, such as chemicals or radiation.

2. **Rate or Frequency of Occurrence**:
   - Rare Mutations: Occur infrequently in populations and are usually recessive.
   - Recurrent Mutations: Occur repeatedly and can potentially change gene frequency in populations.

3. **Kind of Tissue Involved and Type of Inheritance**:
   - Somatic Mutations: Occur in somatic tissue and are not passed on to progeny.
   - Germinal (or Germ-Line) Mutations: Occur in reproductive cells and can be inherited by future generations.

4. **Impact on Fitness or Function**:
   - Deleterious Mutations: Harmful and decrease the fitness of the individual.
   - Advantageous Mutations: Beneficial and increase the fitness of the individual.
   - Neutral Mutations: Neither beneficial 