# Generating Post-Hoc Citations

This notebook demonstrates how to:
1. Take some generated text and find any "Statements" that would help with citations
2. Send the Statements to Highlight's API to get the relevant chunks
3. Show the References in a clean way

## Setup

First, let's install and import the required libraries.

In [8]:
!pip install PyPDF2 openai python-dotenv

import os
import PyPDF2
import openai
import re
from dotenv import load_dotenv
from typing import List, Dict
from base_client import HighlightsClient

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Loading Environment Variables

Create a .env file with your API keys:
```
HIGHLIGHTS_API_KEY=your-highlights-api-key
OPENAI_API_KEY=your-openai-api-key
```

In [28]:
# Load environment variables
load_dotenv()

# Initialize clients
highlights_client = HighlightsClient(api_key=os.getenv('HIGHLIGHTS_API_KEY'))
openai.api_key = os.getenv('OPENAI_API_KEY')

## PDF Processing & Statment extraction
In this example, we'll go ahead and resue the PDF extractor from the previous cookbooks but with a twist, this class will also go ahead and extract out any "Statments" from the LLM answer which require references.

In [23]:
class PDF_Statement_Processor:
    def __init__(self, highlights_client, temperature: float = 0.7):
        self.highlights_client = highlights_client
        self.temperature = temperature

    def extract_text_from_pdf(self, pdf_path: str) -> List[str]:
        """
        Extract text from PDF, split by pages.

        Args:
            pdf_path: Path to the PDF file

        Returns:
            List of strings, where each string is the text from one page
        """
        text_chunks = []

        with open(pdf_path, 'rb') as file:
            # Create PDF reader object
            pdf_reader = PyPDF2.PdfReader(file)

            # Extract text from each page
            for page in pdf_reader.pages:
                text = page.extract_text()
                if text.strip():  # Only add non-empty pages
                    text_chunks.append(text)

        return text_chunks

    def search_relevant_chunks(self, query: str, text_chunks: List[str], top_n: int = 1) -> List[str]:
        """
        Search for relevant chunks using Highlights API.

        Args:
            query: Search query
            text_chunks: List of text chunks to search through
            top_n: Number of top results to return

        Returns:
            List of most relevant text chunks
        """
        results = self.highlights_client.search_text_chunks(
            query=query,
            text_chunks=text_chunks,
            top_n=top_n
        )

        return [result['chunk_txt'] for result in results['results']]

    def find_statements_to_cite(self, query: str) -> str:
        """
        Generate response using OpenAI API with context.

        Args:
            query: The question to answer

        Returns:
            Generated response
        """

        response = openai.chat.completions.create(
            model="gpt-4o-mini",  # or another appropriate model
            messages=[
                {
                    "role": "system",
                    "content": """
                    You are a citation analyzer that identifies statements in text that require supporting evidence. Your task is to:

                    1. Parse the input text and identify distinct factual claims or assertions.
                    2. Extract each statement that would benefit from a citation (facts, statistics, research findings, historical events, etc.).
                    3. Number each statement sequentially.
                    4. Format the output as a numbered list, with one statement per line.
                    5. Only include statements that make factual claims - exclude opinions, personal reflections, and purely subjective content.
                    6. For multi-sentence statements on the same topic or fact, keep them grouped as a single numbered item.
                    7. Do not add any commentary, explanations, or additional text beyond the numbered statements.

                    Example output format:
                    1. [Factual statement that needs citation]
                    2. [Another factual statement that needs citation]
                    3. [A third factual statement that needs citation]"""
                },
                {"role": "user", "content": query}
            ],
            temperature=self.temperature
        )

        return response.choices[0].message.content

    def parse_numbered_text(self, text):
        """
        Parse text with numbered statements into individual statements.

        This function looks for patterns like "1.", "2.", etc. and extracts
        the statements that follow them.

        Args:
            text (str): Text containing numbered statements

        Returns:
            list: List of statements with their numbers
        """
        # Find all statement numbers in the text
        number_positions = [(m.start(), m.group()) for m in re.finditer(r'(?:^|\s+)(\d+)\.', text)]

        statements = []

        # Process each statement
        for i, (pos, num) in enumerate(number_positions):
            # Get the number without the period
            number = num.strip().rstrip('.')

            # Find where the statement starts (after the number and any whitespace)
            start = pos + len(num)
            while start < len(text) and text[start].isspace():
                start += 1

            # Find where the statement ends (at the next number or the end of text)
            if i < len(number_positions) - 1:
                end = number_positions[i+1][0]
            else:
                end = len(text)

            # Extract the statement content
            content = text[start:end].strip()
            if content == "":
                continue
            statements.append({
                "number": number,
                "content": content
            })

        return statements

## Using the PDF & Statement Processor

Now let's try processing a PDF document. And then we'll extract the statements to get the sources.

In [24]:
# Initialize the PDF Processor
processor = PDF_Statement_Processor(highlights_client)

# Process the PDF
pdf_text = processor.extract_text_from_pdf("data/amd_10k.pdf")

# Find Statements to Cite
generated_text = """
According to AMD's 2022 Annual Report, the company's net revenue increased 44% to $23.6 billion in 2022 compared to $16.4 billion in 2021.
This growth was driven by several factors:
1) Data Center segment revenue increased by 64% primarily due to higher sales of EPYC server processors
2) Gaming segment revenue increased by 21% primarily due to higher semi-custom product sales (like gaming console SoCs)
3) Embedded segment revenue saw a significant increase driven by the inclusion of Xilinx embedded product revenue following the acquisition of Xilinx in February 2022"""
statements = processor.find_statements_to_cite(generated_text)



## Generating Sources for the statements using Highlights
Now we'll feed the extracted statments in with the origianl content to formulate accurate sources!

In [27]:
for index, statement in enumerate(processor.parse_numbered_text(statements)):
    print("Statement Number: ", index)
    print("Statement: ", statement["content"])
    relevant_chunks = processor.search_relevant_chunks(statement["content"], pdf_text)
    print("Relevant Sources: ", relevant_chunks)
    print("\n")


Statement Number:  0
Statement:  AMD's net revenue increased 44% to $23.6 billion in 2022 compared to $16.4 billion in
Relevant Sources:  ['ITEM 6. [RESERVED]\nITEM 7. MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND\nRESULTS OF OPERATIONS\nThe following discussion should be read in conjunction with the consolidated financial statements as of\nDecember 31, 2022 and December 25, 2021 and for each of the three years in the period ended December 31,2022 and related notes, which are included in this Annual Report on Form 10-K as well as with the other sectionsof this Annual Report on Form 10-K, “Part II, Item 8: Financial Statements and Supplementary Data.”\nIntroduction\nIn this section, we will describe the general financial condition and the results of operations of Advanced Micro\nDevices, Inc. and its wholly-owned subsidiaries (collectively, “us,” “our” or “AMD”), including a discussion ofour results of operations for 2022 compared to 2021, an analysis of changes in our