# BambooHR Take-home RAG Project

Joe Davison

August 13, 2024

### Load & index documents

I'll start by loading in all the pdfs in the `./documents` directory and preparing them for retrieval.

In [1]:
from pathlib import Path
from PyPDF2 import PdfReader
import pandas as pd

documents = []
for file_path in Path('./documents').iterdir():
    if file_path.suffix.lower() == '.pdf':
        with file_path.open('rb') as file:
            pdf_reader = PdfReader(file)
            text = ''
            for page in pdf_reader.pages:
                text += page.extract_text()
            title = file_path.stem
            documents.append({'text': text, 'title': title})
    else:
        # Future: Handle other file types
        pass

documents_df = pd.DataFrame(documents)
documents_df

Unnamed: 0,text,title
0,Australia women's\nnational softball team\nInf...,Australia Women's Softball Team
1,R.31\nRole Reconnaissance\nManufacturer Renard...,Renard R.31


In [2]:
print(documents_df.text[0][:200])

Australia women's
national softball team
Information
Country
  Australia
Federation Softball Australia
ConfederationWBSC Oceania
WBSC World
Rank10 
 3 (10
November
2023)[1]
Women's Softball World Cup



As is common when working with pdfs, the whitespace handling of the extractecd text is imperfect. High-performance LLMs won't have any problem handling this noise, so I'll leave it as-is for this exercise. In a production system, it might be advisable to add a preprocessing step to clean up this text, either using a rule-based approach or even using an LLM to reformat the text.

##### Split documents into smaller chunks

With just two pdf's, we might be able to load the entire text of both documents into the model context. This wouldn't scale to larger knowledge bases, however. It also risks innundating the model with too much information, which may damage performance, especially with smaller models.

Instead, I will split each document into smaller document *chunks* which will be indexed and retrieved. I will also add some overlap between chunks to minimize the chance of losing important contextual information by splitting a document mid-sentence, for example.

In [3]:
CHUNK_SIZE = 1500 # number of characters per chunk, as determined by whitespace split
OVERLAP = 500 # number of characters to overlap between chunks

chunks = []
for doc_id, row in documents_df.iterrows():
    doc = row['text']
    title = row['title']
    for i in range(0, len(doc), CHUNK_SIZE-OVERLAP):
        chunk = doc[i:i+CHUNK_SIZE].strip()
        chunk = f'<DOCUMENT>\n<TITLE>{title}</TITLE>\n<EXCERPT>\n{chunk}\n</EXCERPT>\n</DOCUMENT>'
        chunks.append({'text': chunk, 'doc_id': doc_id})

chunks_df = pd.DataFrame(chunks)

print(chunks_df.text[1])

<DOCUMENT>
<TITLE>Australia Women's Softball Team</TITLE>
<EXCERPT>
orting teams on the world stage, and they have
achieved outstanding results over the last 3 decades. Alongs ide the
USA team, the Aussie Spirit are the only other team to medal at all
4 Olympics that softball was included as a sport in the Olympics
program.[2] At the inaugural Women's Softball World
Championship held in Melbourne, 1965. Australia claimed the first
ever title, winning Gold and stamped themselves as a pioneer in the
sport.
The national team has not secured as much funding as male
dominated sports in Australia despite having performed better than
some and having won major international competitions.[3] The
removal of softball from the Olympic programme resulted in the
national team getting less funding.[4]
Australian women competed in their first international competition
in 1949 when they played a series against New Zealand in St Kilda
at the St Kilda Cricket Ground.[5] 10,000 people watched the game
liv

##### Create chunk embeddings

In order to retrieve chunks, we will need to compute embeddings for each chunk. I will use OpenAI's `text-embedding-3-small` model for this purpose.

In [4]:
import openai

client = openai.OpenAI()
EMBED_MODEL = 'text-embedding-3-small'

# compute embeddings and add to chunks_df
embeddings = []
for chunk in chunks_df.text:
    embedding = client.embeddings.create(input=chunk, model=EMBED_MODEL)
    embeddings.append(embedding.data[0].embedding)

chunks_df['embedding'] = embeddings

chunks_df.head()

Unnamed: 0,text,doc_id,embedding
0,<DOCUMENT>\n<TITLE>Australia Women's Softball ...,0,"[-0.008297553285956383, -0.0018626594683155417..."
1,<DOCUMENT>\n<TITLE>Australia Women's Softball ...,0,"[0.0019335211254656315, 0.003245603758841753, ..."
2,<DOCUMENT>\n<TITLE>Australia Women's Softball ...,0,"[-0.002718560164794326, 0.007791959680616856, ..."
3,<DOCUMENT>\n<TITLE>Australia Women's Softball ...,0,"[0.009526480920612812, 0.00444354023784399, 0...."
4,<DOCUMENT>\n<TITLE>Australia Women's Softball ...,0,"[-0.004598692990839481, -0.012488720007240772,..."


##### Create retrieval endpoint

We could use a vector databse such as Pinecone or Chroma to store the chunks and retrieve them, or use something simpler like Postgres and compute the similarity scores manually. For this exercise, I will use a simple numpy array to store the embeddings and retrieve them.

In [5]:
import numpy as np

def retrieve_chunks(query, n=5):
    """
    Retrieves the `n` chunks that are closest to `query` in embedding space.
    Selected according to cosine similarity.
    """
    # embed the query
    query_embed = client.embeddings.create(input=query, model=EMBED_MODEL)
    query_embed = query_embed.data[0].embedding
    # compute similarities
    doc_embeds = np.array(chunks_df.embedding.tolist())
    # note: cosine similarity is equivalent to dot product when vectors are unit-length,
    # which is what openai's embeddings API returns
    similarities = doc_embeds @ query_embed
    # sort and select
    chunk_inds = np.argsort(similarities)[::-1][:n]
    retrieved_df = chunks_df.iloc[chunk_inds].copy()
    retrieved_df['similarity'] = similarities[chunk_inds]
    return retrieved_df

Let's test it out with one of the given prompts:

In [6]:
retrieved = retrieve_chunks('Which two companies created the R.31 reconnaissance aircraft?')
print(retrieved.iloc[0].text)

<DOCUMENT>
<TITLE>Renard R.31</TITLE>
<EXCERPT>
R.31
Role Reconnaissance
Manufacturer Renard
First flight 1932
Introduction 1935
Retired 1940
Primary user Belgian Air Force
Number built 34Renard R.31
The Renard R.31 was a Belgian reconna issance
aircraft of the 1930s . A single-engined parasol
monopl ane, 32 R.31s were built for the Belgian Air
Force, the survivors of which, although obsolete,
remained in service when Nazi Germany invaded
Belgium in 1940. The Renard R.31 was the only World
War II operational military aircraft entirely designed and
built in Belgium.
The Renard R.31 was designed by Alfred Renard of
Constructions Aéronautiques G. Renard to meet a
requirement of the Belgian Air Force for a short ranged
reconna issance and army co-operation aircraft. It first
flew from Evere Airfield, near Brussels, on 16 October
1932.[1]
It was a parasol monopl ane of mixed construction,
powered by a Rolls-Royce Kestrel engine, with a welded steel tubing structure with metal sheet covering

### Use GPT-4o to generate response

Now that the retrieval piece is in place, I'll construct a prompt template that will be used to generate a response from GPT-4o using the retrieved chunks.

When constructing the prompt, I will use xml-like <DOCUMENT> tags to separate the documents, which I have found to be a helpful prompt engineering technique for wrapping inserted text in a structured but readable way.

In [7]:
SYSTEM_PROMPT_BASE = """\
You are helpful assistant that can answer user questions using the provided documents.

Documents:
---
{documents}
---

Answer the user's question using information found in the documents above."""

In [8]:
def construct_system_prompt(documents):
    documents_str = '\n\n'.join(documents)
    return SYSTEM_PROMPT_BASE.format(documents=documents_str)

In [9]:
print(construct_system_prompt(chunks_df.text.iloc[:5]))

You are helpful assistant that can answer user questions using the provided documents.

Documents:
---
<DOCUMENT>
<TITLE>Australia Women's Softball Team</TITLE>
<EXCERPT>
Australia women's
national softball team
Information
Country
  Australia
Federation Softball Australia
ConfederationWBSC Oceania
WBSC World
Rank10 
 3 (10
November
2023)[1]
Women's Softball World Cup
Appearances 17 (First in 1965)
Best result
  1st (1 time, in
1965)
USA Softball International Cup
Appearances 8 (First in 2005)
Best result
  2nd (2 times,
most recent in
2012)
Olympic Games
Appearances 5 (First in 1996)
Best result
  2nd (1 time, in
2004)
Australia women's national
softball team
Medal record
Softball at the Summer Olympics
Representing 
 Australia
Olympic Games
2004 Athens Team
1996 Atlanta Team
2000 Sydney Team
2008 Beijing Team
World Championship
1965 MelbourneAustralia women's national softball team
The Australia women's national softball team, also know n as
the Aussie Spirit,[2] is the national soft

##### Generate response from LLM

Now that we have our retrieval and system prompts in place, we can use the OpenAI client to generate a response from GPT-4o.

In [10]:
def generate_response(query, n_retrieved=5):
    retrieved = retrieve_chunks(query, n=n_retrieved)
    system_prompt = construct_system_prompt(retrieved.text)
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{'role': 'system', 'content': system_prompt}, {'role': 'user', 'content': query}]
    )
    return response.choices[0].message.content

In [11]:
question = 'Which two companies created the R.31 reconnaissance aircraft?'
response = generate_response(question)
print(response)

The R.31 reconnaissance aircraft was created by Renard, with six built by Renard and the remainder by SABCA.


In [12]:
question = 'What guns were mounted on the Renard R.31?'
response = generate_response(question)
print(response)

The Renard R.31 was mounted with one or two forward-firing 7.62 mm Vickers machine guns and one 7.62 mm Lewis machine gun in a flexible mount in the rear cockpit.


In [13]:
question = 'Who was the first softball player to represent any country at four World Series of Softball?'
response = generate_response(question)
print(response)

Majorie Nelson was the first softball player to represent any country at four World Series of Softball.


In [14]:
question = 'Who were the pitchers on the Australian softball team\'s roster at the 2020 Summer Olympics?'
response = generate_response(question)
print(response)

The pitchers on the Australian softball team's roster at the 2020 Summer Olympics were:

- Ellen Roberts
- Tarni Stepto
- Kaia Parnaby
- Gabbie Plain


### Discussion of scaling challenges

This is the most trivial example of a RAG system, and it's probably sub-optimal when we start to scale the number of documents. Here are some challenges that we might run into and how we could address them:

1. **Information recall.** As the number of documents increases, the recall of the system will decrease. It will become more likely that a simple dot-product vector search will not return the documents needed to answer a user's query. Some of the following strategies may help:
    - **Better chunking and preprocessing.** In this exercise, I simply extracted the entire text of the pdf and split it into chunks. This worked well for two documents, but it may not be sufficient for a larger knowledge base. We could use a more sophisticated method of chunking such as splitting at meaningful sentence boundaries. We could also use an LLM to reformat the text into cleaner, more distinct blocks of text. We might might want to provide more metadata to the chunks, such as the heading name or the page number of the excerpt in the document (similar to how we added the document title to each chunk).
    - **Reranking.** One successful approach to improving retrieval pipelines is to incorporate a *reranker* model. In this setup, we first retrieve a large number of candidate chunks using vector search, and then use a more expensive model to directly compare each candidate chunk to the query, resulting in a higher quality set of final documents than would be returned by vector search alone.
    - **Hybrid vector/keyword search**. Another common approach is to combine vector search with a more traditional keyword search. This hybrid approach is often used in production systems, and it can often be more successful than either approach in isolation.

2. **Model limitations.** GPT-4o is a powerful model, but may not be practical for every use case due to its high API costs, latency, and lack of on-premise access. This could be addressed by using an open-access model and deploying it locally, but may come at the cost of lower response quality. This may require us to spend additional time tuning the prompt and the retrieval pipeline in order to get GPT4-quality answers.

3. **Incomplete or conflicting information.** In the provided questions in this exercise, the answer was clearly stated in one of the provided documents. In a real-world scenario, we would likely run into situations where the answer is not provided in the documents, or where the documents contain conflicting or outdated information. Careful prompt engineering and extensive testing would be needed to ensure that model hallucinations or outdated documents do not mislead users.