# PDF Processing with Highlights and OpenAI Integration

This notebook demonstrates how to:
1. Convert a PDF document to text
2. Split the text into page-level chunks
3. Use Highlights API to search for relevant chunks
4. Send the most relevant chunks to OpenAI for text generation

## Setup

First, let's install and import the required libraries.

In [None]:
!pip install PyPDF2 openai python-dotenv

import os
import PyPDF2
import openai
from dotenv import load_dotenv
from typing import List, Dict
from base_client import HighlightsClient

## Loading Environment Variables

Create a .env file with your API keys:
```
HIGHLIGHTS_API_KEY=your-highlights-api-key
OPENAI_API_KEY=your-openai-api-key
```

In [23]:
# Load environment variables
load_dotenv()

# Initialize clients
highlights_client = HighlightsClient(api_key=os.getenv('HIGHLIGHTS_API_KEY'))
openai.api_key = os.getenv('OPENAI_API_KEY')

## PDF Processing Class

Let's create a class to handle our PDF processing workflow. Our simple implementation will extract text from a PDF, split it into chunks based on pages, which will thereafter be fed into the Highlights API to search for relevant chunks.

In [5]:
class PDFProcessor:
    def __init__(self, highlights_client, temperature: float = 0.7):
        self.highlights_client = highlights_client
        self.temperature = temperature

    def extract_text_from_pdf(self, pdf_path: str) -> List[str]:
        """
        Extract text from PDF, split by pages.

        Args:
            pdf_path: Path to the PDF file

        Returns:
            List of strings, where each string is the text from one page
        """
        text_chunks = []

        with open(pdf_path, 'rb') as file:
            # Create PDF reader object
            pdf_reader = PyPDF2.PdfReader(file)

            # Extract text from each page
            for page in pdf_reader.pages:
                text = page.extract_text()
                if text.strip():  # Only add non-empty pages
                    text_chunks.append(text)

        return text_chunks

    def search_relevant_chunks(self, query: str, text_chunks: List[str], top_n: int = 3) -> List[str]:
        """
        Search for relevant chunks using Highlights API.

        Args:
            query: Search query
            text_chunks: List of text chunks to search through
            top_n: Number of top results to return

        Returns:
            List of most relevant text chunks
        """
        results = self.highlights_client.search_text_chunks(
            query=query,
            text_chunks=text_chunks,
            top_n=top_n
        )

        return [result['chunk_txt'] for result in results['results']]

    def generate_response(self, query: str, context: List[str]) -> str:
        """
        Generate response using OpenAI API with context.

        Args:
            query: The question to answer
            context: List of relevant text chunks to use as context

        Returns:
            Generated response
        """
        # Combine context and prompt
        combined_prompt = f"""
        Context information is below.
        ----------------
        {' '.join(context)}
        ----------------
        Using the above context, please answer the following question: {query}
        """

        response = openai.chat.completions.create(
            model="gpt-4o-mini",  # or another appropriate model
            messages=[
                {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
                {"role": "user", "content": combined_prompt}
            ],
            temperature=self.temperature
        )

        return response.choices[0].message.content

## Using the PDF Processor

Now let's try processing a PDF document.

In [None]:
# Initialize processor
processor = PDFProcessor(highlights_client)

# Path to your PDF file
pdf_path = 'data/border_act.pdf'

# Extract text chunks from PDF
text_chunks = processor.extract_text_from_pdf(pdf_path)

print(f"Extracted {len(text_chunks)} pages from PDF")
print("\nSample from first page:")
print(text_chunks[0][:200] + "...")

## Searching and Generating

Let's search for relevant chunks and generate a response based on them.

In [None]:
query = "Am I an eligible individual for CONDITIONAL PERMANENT RESIDENT STATUS? I was paroled into the us in 2020."

# Search for relevant chunks
relevant_chunks = processor.search_relevant_chunks(
    query=query,
    text_chunks=text_chunks,
    top_n=5
)

print(f"Found {len(relevant_chunks)} relevant chunks")
# Generate response using OpenAI
response = processor.generate_response(
    query=query,
    context=relevant_chunks
)

print("\nGenerated Response:")
print(response)