# Retrieval Augmented Generation

To include context for Claude, long prompts cost more money and time to process, and Claude is slightly less effective. 

##### With RAG:
1. Break up the document in chunks
2. Feed only relevant chunks to the prompt

<br>

| Upsides | Downsides |
| -- | -- |
| - Focus on relevant content only <br> - Scale to large & multiple docs. <br> - Smaller prompt (cheaper and faster) | - Pre-procesing <br> - Needs searching mechanism to find relevant chunk <br> - May leave out some relevant content <br> - Many ways to chunk the text | 

#### Imports and constants

In [10]:
import voyageai
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()

# add VOYAGE_API_KEY to .env
embedding_client = voyageai.Client()

# Claude
client = Anthropic()
model = "claude-3-7-sonnet-latest"

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
import json
import re
from math import sqrt
from lib.SearchStrategiesClass import VectorIndex, BM25Index, Retriever
from lib.ClaudeResponsesClass import ClaudeResponses

## 1. Chunking Strategies

| Size Based | Structure Based | Semantic Based |
| -- | -- | -- |
| Divide text into strings of equal length (overlap sections to include some context) | Divide text based upon structure (headers, paragraphs, sections, ...) | Divide into groups of related sentences or sections |
| Easy to implement | Keeps related content in the same chunk | Needs to understand meaning of individual sentences |
| Can be repetitive, migh separate related content | Needs us to understand the structure of the document beforehand | Computationally expensive, but more relevant chunks |

In [14]:
with open("report.md", "r") as f:
    text = f.read()

In [2]:
# Chunk by a set number of characters
def chunk_by_char(text: str,
                  chunk_size: int = 150,
                  chunk_overlap: int = 20
):
    chunks = []
    start_idx = 0
    while start_idx < len(text):
        end_idx = min(start_idx + chunk_size, len(text))
        chunk = text[start_idx:end_idx]
        chunks.append(chunk)
        start_idx = (
            end_idx - chunk_overlap if end_idx < len(text) else len(text)
        )
    return chunks

In [3]:
chunks = chunk_by_char(text)

for chunk in chunks[0:5]:
    print(chunk + "\n\n -------------------------------- \n ")

# **Annual Interdisciplinary Research Review: Cross-Domain Insights**

## Executive Summary

This report synthesizes the key findings and ongoing rese

 -------------------------------- 
 
ngs and ongoing research efforts across the organization's diverse operational and R&D departments for the past fiscal year. Our strength lies in the 

 -------------------------------- 
 
trength lies in the cross-pollination of ideas and methodologies, driving innovation and addressing complex challenges that transcend traditional disc

 -------------------------------- 
 
end traditional disciplinary boundaries. This year's review highlights significant progress in ten critical areas. Advances in **Medical Research** fo

 -------------------------------- 
 
edical Research** focused on the rare XDR-471 syndrome, yielding new diagnostic insights. Concurrently, **Software Engineering** tackled persistent st

 -------------------------------- 
 


In [7]:
# Chunk by sentence
def chunk_by_sentence(text: str,
                      max_sentences: int = 5,
                      overlap_sentences: int = 1
):
    sentences = re.split(r"(?<=[.!?])\s+", text)
    chunks = []
    start_idx = 0
    while start_idx < len(sentences):
        end_idx = min(start_idx + max_sentences, len(sentences))
        chunk = sentences[start_idx:end_idx]
        chunks.append(" ".join(chunk))

        start_idx += max_sentences - overlap_sentences

        if start_idx < 0:
            start_idx = 0

    return chunks

In [8]:
chunks = chunk_by_sentence(text)

for chunk in chunks[0:5]:
    print(chunk + "\n\n -------------------------------- \n ")

# **Annual Interdisciplinary Research Review: Cross-Domain Insights**

## Executive Summary

This report synthesizes the key findings and ongoing research efforts across the organization's diverse operational and R&D departments for the past fiscal year. Our strength lies in the cross-pollination of ideas and methodologies, driving innovation and addressing complex challenges that transcend traditional disciplinary boundaries. This year's review highlights significant progress in ten critical areas. Advances in **Medical Research** focused on the rare XDR-471 syndrome, yielding new diagnostic insights. Concurrently, **Software Engineering** tackled persistent stability issues, implementing key fixes identified through error code analysis (e.g., `ERR_MEM_ALLOC_FAIL_0x8007000E`).

 -------------------------------- 
 
Concurrently, **Software Engineering** tackled persistent stability issues, implementing key fixes identified through error code analysis (e.g., `ERR_MEM_ALLOC_FAIL_0x800700

In [12]:
# Chunk by section (example for markdown)
def chunk_by_section(document_text, pattern = r"\n##"):
    return re.split(pattern, document_text)

In [15]:
chunks = chunk_by_section(text)

for chunk in chunks[0:5]:
    print(chunk + "\n\n -------------------------------- \n ")

# **Annual Interdisciplinary Research Review: Cross-Domain Insights**


 -------------------------------- 
 
 Executive Summary

This report synthesizes the key findings and ongoing research efforts across the organization's diverse operational and R&D departments for the past fiscal year. Our strength lies in the cross-pollination of ideas and methodologies, driving innovation and addressing complex challenges that transcend traditional disciplinary boundaries. This year's review highlights significant progress in ten critical areas. Advances in **Medical Research** focused on the rare XDR-471 syndrome, yielding new diagnostic insights. Concurrently, **Software Engineering** tackled persistent stability issues, implementing key fixes identified through error code analysis (e.g., `ERR_MEM_ALLOC_FAIL_0x8007000E`). **Financial Analysis** revealed mixed quarterly performance, prompting strategic reviews, particularly concerning resource allocation impacting R&D pipelines.

Crucial develop

## 2. Text Embeddings

Search problem: look through chunks and identify the most related to the user's question.

1. **Semantic Search**: uses ***text embeddings*** to understand the meaning and context of question and chunks.
    - We must choose an embedding model. Anthropic recommends **VoyageAI** for Claude models.
2. **Keyword-based** *(not in this notebook)*

In [17]:
# Simple embedding function
def generate_embedding(chunks: str | list,
                       model: str = "voyage-3-large",
                       input_type: str = "query"
):
    is_list = isinstance(chunks, list)
    input = chunks if is_list else [chunks]
    result = embedding_client.embed(input, model, input_type)
    return result.embeddings if is_list else result.embeddings[0]

In [None]:
chunks = chunk_by_char(text)
results = generate_embedding(chunks[0])

In [12]:
print("Length of embedding:", len(results))
results[:15]

Length of embedding: 1024


[-0.07848451286554337,
 0.030050374567508698,
 -0.007380018476396799,
 0.006761334370821714,
 -0.003800488542765379,
 0.03606044873595238,
 -0.03765135258436203,
 0.02633826993405819,
 0.00614265026524663,
 -0.010561822913587093,
 0.006275225430727005,
 0.0311109758913517,
 -0.0012318444205448031,
 -0.01228530053049326,
 -0.0233332309871912]

## 3. Search Strategies

### 3.1 Embeddings, VectorDB and Cosine Similarity

#### 3.1.1 Cosine Similarity

User's query and chunk's embeddings are compared through Cosine Similarity to choose the most related chunk of text (cosine of the angle between 2 vectors – $\theta$):

**Cosine Similarity**
$$
Sim(A,B) = cos(\theta) = \frac{A \cdot B}{||A||\cdot||B||}
$$
* **1**: Exactly same direction
* **0**: Perpendicular (independent)
* **-1**: Completely opposite direction

**Cosine Distance**
$$1 - Similarity$$
* **0**: Exactly same direction
* **1**: Perpendicular (independent)
* **2**: Completely opposite direction


In [None]:
# Cosine Similarity
def cosine_similarity(a: list[float], b: list[float]):
    if len(a) != len(b):
        raise ValueError("Both entries must have the same length.")
    else:
        dot_product = 0
        magn_a = 0
        magn_b = 0
        for idx in range(len(a)):
            dot_product += a[idx]*b[idx]
            magn_a += a[idx]**2
            magn_b += b[idx]**2
        if magn_a == 0 or magn_b == 0:
            return 0
        else:
            cos_sim = dot_product/(sqrt(magn_a)*sqrt(magn_b))
            return cos_sim
        
# Example usage
a = [0.112, 0.993]
b = [0.295, 0.955]

cosine_similarity(a,b)

0.9825129171527335

In [28]:
# with sklearn
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np 
float(cosine_similarity(np.array([a]), np.array([b]))[0][0])

0.9825129171527336

#### 3.1.2 Semantic Search

Match user's query and content by Cosine Distance.

Five steps to follow:

1. Chunk the text by section
2. Generate embeddings for each chunk
3. Create a vector store and add each embedding to it
4. Generate an embedding for the user's question
5. Search the store to find the most relevant chunks

In [15]:
# 1,2. Chunk and generate embeddings
chunks = chunk_by_section(text)
embeddings = generate_embedding(chunks)

In [16]:
len(chunks) == len(embeddings)

True

In [17]:
# 3. Generate Vector store
vec_store = VectorIndex()
for embedding, chunk in zip(embeddings, chunks):
    vec_store.add_vector(embedding, {"content": chunk})

In [18]:
# 4. Generate embedding for user's question
user_query = "What did the software engineering dept do last year?"
user_embedding = generate_embedding(user_query)

In [19]:
# 5. Search vector store for match
results = vec_store.search(user_embedding, 3) # get 3 top matches

In [20]:
# See best matches
for result in results:
    print("\n------------------------------------------")
    print("Cosine Distance:", result[1])
    print("\nContent:\n", result[0]['content'][:200])


------------------------------------------
Cosine Distance: 0.4820578028613297

Content:
  Section 2: Software Engineering - Project Phoenix Stability Enhancements

The Software Engineering division dedicated considerable effort to improving the stability and performance of the core system

------------------------------------------
Cosine Distance: 0.48413100279652643

Content:
  Executive Summary

This report synthesizes the key findings and ongoing research efforts across the organization's diverse operational and R&D departments for the past fiscal year. Our strength lies 

------------------------------------------
Cosine Distance: 0.48764751365959746

Content:
  Future Directions

This year's cross-domain insights underscore the interconnectedness of our diverse research and operational activities. The stability enhancements achieved in Software Engineering 


### 3.2 Lexical Search

When building RAG pipelines, you'll quickly discover that semantic search alone doesn't always return the best results. Sometimes you need exact term matches that semantic search might miss. The solution is to combine semantic search with lexical search.

* **Semantic search** finds conceptually related content using embeddings.
* **Lexical search** finds exact term matches using classic text search.
* **Merged results** combine both approaches for better accuracy.

#### BM25 (Best Match v. 25)

1. Tokenize the query
    Break the user's question into individual terms. For example, "a INC-2023-Q4-011" becomes ["a", "INC-2023-Q4-011"].

2. Count term frequency
    See how often each term appears across all your documents. Common words like "a" might appear 5 times, while specific terms like "INC-2023-Q4-011" might appear only once.

3. Weight terms by importance
    Terms that appear less frequently get higher importance scores. The word "a" gets low importance because it's common, while "INC-2023-Q4-011" gets high importance because it's rare.

4. Find best matches
    Return documents that contain more instances of the higher-weighted terms.

In [21]:
# 2. Create a BM25 store and add documents
bm25_store = BM25Index()
for chunk in chunks:
    bm25_store.add_document({"content": chunk})

In [49]:
# Semantic Search Results
user_query = "What happened with INC-2023-Q"
user_embedding = generate_embedding(user_query)
vec_results = vec_store.search(user_embedding, 3) # get 3 top matches

In [50]:
for content, score in vec_results:
    print("\n------------------------------------------")
    print("Cosine Distance:", score)
    print("\nContent:\n", content['content'][:150])


------------------------------------------
Cosine Distance: 0.52053258992134

Content:
  Section 2: Software Engineering - Project Phoenix Stability Enhancements

The Software Engineering division dedicated considerable effort to improvin

------------------------------------------
Cosine Distance: 0.595024764747381

Content:
  Table of Contents

1.  Executive Summary
2.  Table of Contents
3.  Methodology
4.  Section 1: Medical Research - Understanding XDR-471 Syndrome
5.  S

------------------------------------------
Cosine Distance: 0.6011882575100602

Content:
  Section 10: Cybersecurity Analysis - Incident Response Report: INC-2023-Q4-011

The Cybersecurity Operations Center successfully contained and remedi


In [51]:
# 3. BM25 Store
bm25_results = bm25_store.search(user_query, 3)

# Printbm25_ results
for doc, distance in bm25_results:
    print("\n------------------------------------------")
    print("Distance:", distance)
    print("\nContent:\n", doc['content'][:150])
    


------------------------------------------
Distance: 0.522946695317638

Content:
  Section 2: Software Engineering - Project Phoenix Stability Enhancements

The Software Engineering division dedicated considerable effort to improvin

------------------------------------------
Distance: 0.5591483528614661

Content:
  Section 10: Cybersecurity Analysis - Incident Response Report: INC-2023-Q4-011

The Cybersecurity Operations Center successfully contained and remedi

------------------------------------------
Distance: 0.9391180106840461

Content:
  Methodology

The insights compiled within this Annual Interdisciplinary Research Review represent a synthesis of findings drawn from standard departm


### 3.3 Hybrid Search - Multi-Index Architecture

Create a <code>Retreiver</code> class that merges both indexes. 

#### Reciprocal Rank Fusion
1. Rank by each index output
2. Compute RRF score

$$
RRF\_score(d) = \sum_{i = 1}^{n} \frac{1}{k + rank_i (d)}
$$

where $k$ is a constant, usually 60.

3. Rank the RRF score


**For example:**
| Section | VectorIndex ranking | BM25 ranking | RRF Score (k = 1) | RRF ranking |
|--|--|--|--|--|
| 2. Software Engineering | 1 | 1 | $\frac{1}{1+1} + \frac{1}{1+1} = 1$ | 1 |
| Table of Contents | 2 | - | $\frac{1}{1+2} + \frac{1}{1+4} = 0.533$ | 3 |
| 10. Cybersecurity Analysis | 3 | 2 | $\frac{1}{1+3} + \frac{1}{1+2} = 0.583$ | 2 |
| Methodology | - | 3 | $\frac{1}{1+4} + \frac{1}{1+3} = 0.45$ | 4 |

In [18]:
vector_index = VectorIndex(embedding_fn=generate_embedding)
bm25_index = BM25Index()
retriever = Retriever(vector_index, bm25_index)

In [19]:
# Add all chunks to the retriever, which internally passes them along to both indexes
# Note: converted to a bulk operation to avoid rate limiting errors from VoyageAI (add_documents instead of add_document)
retriever.add_documents([{"content": chunk} for chunk in chunks])

In [21]:
user_query = "What happened with INC-2023-Q"
results = retriever.search(user_query, 3)

In [23]:
# Print overall results
for doc, score in results:
    print("\n------------------------------------------")
    print("Score:", score)
    print("\nContent:\n", doc['content'][:200])


------------------------------------------
Score: 0.03278688524590164

Content:
  Section 2: Software Engineering - Project Phoenix Stability Enhancements

The Software Engineering division dedicated considerable effort to improving the stability and performance of the core system

------------------------------------------
Score: 0.03200204813108039

Content:
  Section 10: Cybersecurity Analysis - Incident Response Report: INC-2023-Q4-011

The Cybersecurity Operations Center successfully contained and remediated a targeted intrusion attempt tracked as `INC-

------------------------------------------
Score: 0.031024531024531024

Content:
  Methodology

The insights compiled within this Annual Interdisciplinary Research Review represent a synthesis of findings drawn from standard departmental reporting cycles, specialized project update


## 4. LLM-Based Re-ranking

Ask Claude to look at the user's question. We provide the chunks that seem to be relevant, and Claude will return the most relevant, re ordered.

In [48]:
claude = ClaudeResponses()

In [63]:
def reranker_fn(docs, query_text, k):
    joined_docs = "\n".join(
        [
            f"""
        <document>
        <document_id>{doc["id"]}</document_id>
        <document_content>{doc["content"]}</document_content>
        </document>
        """
            for doc in docs
        ]
    )

    prompt = f"""
    You are about to be given a set of documents, along with an id of each.
    Your task is to select and sort the {k} most relevant documents to answer the user's question.

    Here is the user's question:
    <question>
    {query_text}
    </question>
    
    Here are the documents to select from:
    <documents>
    {joined_docs}
    </documents>

    Respond in the following format:
    ```json
    {{
        "document_ids": str[] # List document ids, {k} elements long, sorted in order of decreasing relevance to the user's query. The most relevant documents should be listed first.
    }}
    ```

    Do not include any reasoning. Return ONLY the json.
    """
    claude.add_user_message(prompt)
    claude.add_assistant_message("```json")
    result = claude.chat(return_text= True, stop_sequences = ["´´´"])

    return json.loads(str(result))['document_ids']


In [64]:
# Create a vector index, a bm25 index, then use them to create a Retriever
vector_index = VectorIndex(embedding_fn=generate_embedding)
bm25_index = BM25Index()

retriever = Retriever(bm25_index, vector_index, reranker_fn=reranker_fn)

In [65]:
# Add all chunks to the retriever, which internally passes them along to both indexes
# Note: converted to a bulk operation to avoid rate limiting errors from VoyageAI
retriever.add_documents([{"content": chunk} for chunk in chunks])

In [67]:
results = retriever.search("what did the eng team do with INC-2023-Q4-011?", 2)

for doc, score in results:
    print(score, "\n", doc["content"][0:200], "\n---\n")


{
    "document_ids": ["Z7wr", "faNd"]
}
```


JSONDecodeError: Extra data: line 5 column 1 (char 42)