# Question answering using embeddings-based search

This notebook demonstrates a two-step Search-Ask method for enabling GPT to answer questions using a library of reference text.

1. **Search:** search your library of text for relevant text sections
2. **Ask:** insert the retrieved text sections into a message to GPT and ask it the question

In [10]:
import ast  # for converting embeddings saved as strings back to arrays
import openai  # for generating embeddings
import os  # for environment variables
import pandas as pd  # for DataFrames to store article sections and embeddings
import tiktoken  # for counting tokens
import dotenv  # for loading environment variables
from scipy import spatial  # for calculating vector similarities for search
import pymupdf # for parsing PDFs

dotenv.load_dotenv(".env")

client = openai.OpenAI()

In [11]:
# Variables section
DOCUMENTS_FOLDER = os.getcwd() + "/data/raw" # the folder wit rqw PDFs
GPT_MODEL = "gpt-4o-mini"  # only matters insofar as it selects which tokenizer to use
EMBEDDING_MODEL = "text-embedding-3-small" # the model to create embeddings (have to be the same during database preparation and asking)
MAX_TOKENS = 800 # the number of tokens allowed for one string in tokenizer
BATCH_SIZE = 500  # you can submit up to 2048 embedding inputs per request
EMBEDDING_PATH = os.getcwd() + "/data/sloboda_knowledge_base.csv"

# 1. Embedding Sloboda-studio articles for search

This part shows how we prepared a dataset of internal articles for search

Procedure:

1. Collect: We parse texts from PDFs
2. Embed: Each section is embedded with the OpenAI API
3. Store: Embeddings are saved in a CSV file (for large datasets, use a vector database)

## 1.1. Collect documents

We parse texts from PDF

In [12]:
# open data/raw directory and iterate over docs
docs = []

for filename in os.listdir(DOCUMENTS_FOLDER):
    print("-- filename", filename)
    
    text = ""
    with pymupdf.open(os.path.join(DOCUMENTS_FOLDER, filename)) as doc:
        for page in doc: # iterate the document pages
            text += page.get_text()
            
    docs.append(([[filename], text]))
            
# print("docs: ", docs)

-- filename Certification & Education reimbursement policy.pdf
-- filename Time tracking People Force .pdf
-- filename Onboarding Presentation 2024.pdf
-- filename Sunflower game.pdf
-- filename Vacation policy.pdf
-- filename Hardware provision policy.pdf


Next, we'll recursively split long sections into smaller sections.

There's no perfect recipe for splitting text into sections.

Some tradeoffs include:
- Longer sections may be better for questions that require more context
- Longer sections may be worse for retrieval, as they may have more topics muddled together
- Shorter sections are better for reducing costs (which are proportional to the number of tokens)
- Shorter sections allow more sections to be retrieved, which may help with recall
- Overlapping sections may help prevent answers from being cut by section boundaries

Here, we'll use a simple approach and limit sections to 1,600 tokens each, recursively halving any sections that are too long. To avoid cutting in the middle of useful sentences, we'll split along paragraph boundaries when possible.

In [13]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def halved_by_delimiter(string: str, delimiter: str = "\n") -> list[str, str]:
    """Split a string in two, on a delimiter, trying to balance tokens on each side."""
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]  # no delimiter found
    elif len(chunks) == 2:
        return chunks  # no need to search for halfway point
    else:
        total_tokens = num_tokens(string)
        halfway = total_tokens // 2
        best_diff = halfway
        for i, chunk in enumerate(chunks):
            left = delimiter.join(chunks[: i + 1])
            left_tokens = num_tokens(left)
            diff = abs(halfway - left_tokens)
            if diff >= best_diff:
                break
            else:
                best_diff = diff
        left = delimiter.join(chunks[:i])
        right = delimiter.join(chunks[i:])
        return [left, right]


def truncated_string(
    string: str,
    model: str,
    max_tokens: int,
    print_warning: bool = True,
) -> str:
    """Truncate a string to a maximum number of tokens."""
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated_string = encoding.decode(encoded_string[:max_tokens])
    if print_warning and len(encoded_string) > max_tokens:
        print(f"Warning: Truncated string from {len(encoded_string)} tokens to {max_tokens} tokens.")
    return truncated_string


def split_strings_from_subsection(
    subsection: tuple[list[str], str],
    max_tokens: int = 1000,
    model: str = GPT_MODEL,
    max_recursion: int = 5,
) -> list[str]:
    """
    Split a subsection into a list of subsections, each with no more than max_tokens.
    Each subsection is a tuple of parent titles [H1, H2, ...] and text (str).
    """
    titles, text = subsection
    string = "\n\n".join(titles + [text])
    num_tokens_in_string = num_tokens(string)
    # if length is fine, return string
    if num_tokens_in_string <= max_tokens:
        return [string]
    # if recursion hasn't found a split after X iterations, just truncate
    elif max_recursion == 0:
        return [truncated_string(string, model=model, max_tokens=max_tokens)]
    # otherwise, split in half and recurse
    else:
        titles, text = subsection
        for delimiter in ["\n\n", "\n", ". "]:
            left, right = halved_by_delimiter(text, delimiter=delimiter)
            if left == "" or right == "":
                # if either half is empty, retry with a more fine-grained delimiter
                continue
            else:
                # recurse on each half
                results = []
                for half in [left, right]:
                    half_subsection = (titles, half)
                    half_strings = split_strings_from_subsection(
                        half_subsection,
                        max_tokens=max_tokens,
                        model=model,
                        max_recursion=max_recursion - 1,
                    )
                    results.extend(half_strings)
                return results
    # otherwise no split was found, so just truncate (should be very rare)
    return [truncated_string(string, model=model, max_tokens=max_tokens)]

In [14]:
document_strings = []
for section in docs:
    document_strings.extend(split_strings_from_subsection(section, max_tokens=MAX_TOKENS))

print(f"{len(docs)} Document sections split into {len(document_strings)} strings.")

6 Document sections split into 16 strings.


In [15]:
document_strings

["Certification & Education reimbursement policy.pdf\n\nCertification & Education Reimbursement Policy\nAt Sloboda Studio, we're all about helping you grow professionally. That's why we've set\nup a simple policy to support you in attending external courses and conferences or\ngetting certifications that boost your skills.\nWho's Eligible?\nFull-time employees, with us for 6+ months.\nHow to Apply?\nJust chat with HR before starting your certification journey.\nMoney Matters\nFor Certification\n●\nPass the Certification: We cover the full cost (up to $1,000).\n●\nDon't Pass? No worries, we still cover half the cost within abovementioned limit.\nFor external Courses\n●\nWe cover half the cost (up to $1,000).\nFor Conferences\n●\nWe cover the full cost (up to $1,000) *\n* prior Direct Manager approval is needed\nRemember!\nAfter the completion / pass: Show your certificate to HR, and we'll handle the payment.\nEvery case is unique. We're here to discuss and assist further if you need it.

## 1.2. Embed document chunks

Now that we've split our library into shorter self-contained strings, we can compute embeddings for each.

In [19]:
embeddings = []
for batch_start in range(0, len(document_strings), BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    print(f"Batch {batch_start} to {batch_end-1}")
    batch = document_strings[batch_start:batch_end]
    print(batch)
    response = client.embeddings.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response.data):
        assert i == be.index  # double check embeddings are in same order as input
    batch_embeddings = [e.embedding for e in response.data]
    embeddings.extend(batch_embeddings)

df = pd.DataFrame({"text": document_strings, "embedding": embeddings})

Batch 0 to 499
["Certification & Education reimbursement policy.pdf\n\nCertification & Education Reimbursement Policy\nAt Sloboda Studio, we're all about helping you grow professionally. That's why we've set\nup a simple policy to support you in attending external courses and conferences or\ngetting certifications that boost your skills.\nWho's Eligible?\nFull-time employees, with us for 6+ months.\nHow to Apply?\nJust chat with HR before starting your certification journey.\nMoney Matters\nFor Certification\n●\nPass the Certification: We cover the full cost (up to $1,000).\n●\nDon't Pass? No worries, we still cover half the cost within abovementioned limit.\nFor external Courses\n●\nWe cover half the cost (up to $1,000).\nFor Conferences\n●\nWe cover the full cost (up to $1,000) *\n* prior Direct Manager approval is needed\nRemember!\nAfter the completion / pass: Show your certificate to HR, and we'll handle the payment.\nEvery case is unique. We're here to discuss and assist further 

In [24]:
df

Unnamed: 0,text,embedding
0,Certification & Education reimbursement policy...,"[0.010979884304106236, 0.0063405423425138, 0.0..."
1,Time tracking People Force .pdf\n\nTime tracki...,"[-0.044048428535461426, 0.05170198529958725, -..."
2,Time tracking People Force .pdf\n\nспівробітни...,"[-0.03341351076960564, 0.04654759168624878, 0...."
3,Time tracking People Force .pdf\n\nHead of Sal...,"[-0.03960372135043144, 0.06344015151262283, 0...."
4,Time tracking People Force .pdf\n\nцього приво...,"[-0.04740346968173981, 0.0653124749660492, -0...."
5,Time tracking People Force .pdf\n\nз іншими не...,"[-0.047159258276224136, 0.06108080968260765, -..."
6,Time tracking People Force .pdf\n\nTasks створ...,"[-0.02253440022468567, 0.0616401806473732, -0...."
7,Time tracking People Force .pdf\n\nповідомляє ...,"[-0.03658929467201233, 0.05917266756296158, 0...."
8,Time tracking People Force .pdf\n\nHR необхідн...,"[-0.04666323587298393, 0.058744896203279495, 0..."
9,Onboarding Presentation 2024.pdf\n\nWelcome to...,"[-0.026788972318172455, 0.03693313151597977, -..."


## 1.3. Store document chunks and embeddings

We'll store them in a CSV file.

(For larger datasets, use a vector database, which will be more performant.)

In [21]:
# save document chunks and embeddings
df.to_csv(EMBEDDING_PATH, index=False)

# 2. Chat functionality

This is an independent part for searching/asking process, based on the indexed data, collected, processed and saved in the part 1 of tis notebook

Contents:
1. Get prepared data
2. Search
3. Ask

# 2.1 Get prepared search data

This data was prepared and saved in the section 1 of this notebook. 

In [22]:
df = pd.read_csv(EMBEDDING_PATH)

# convert embeddings from CSV str type back to list type
df['embedding'] = df['embedding'].apply(ast.literal_eval)

In [23]:
df

Unnamed: 0,text,embedding
0,Certification & Education reimbursement policy...,"[0.010979884304106236, 0.0063405423425138, 0.0..."
1,Time tracking People Force .pdf\n\nTime tracki...,"[-0.044048428535461426, 0.05170198529958725, -..."
2,Time tracking People Force .pdf\n\nспівробітни...,"[-0.03341351076960564, 0.04654759168624878, 0...."
3,Time tracking People Force .pdf\n\nHead of Sal...,"[-0.03960372135043144, 0.06344015151262283, 0...."
4,Time tracking People Force .pdf\n\nцього приво...,"[-0.04740346968173981, 0.0653124749660492, -0...."
5,Time tracking People Force .pdf\n\nз іншими не...,"[-0.047159258276224136, 0.06108080968260765, -..."
6,Time tracking People Force .pdf\n\nTasks створ...,"[-0.02253440022468567, 0.0616401806473732, -0...."
7,Time tracking People Force .pdf\n\nповідомляє ...,"[-0.03658929467201233, 0.05917266756296158, 0...."
8,Time tracking People Force .pdf\n\nHR необхідн...,"[-0.04666323587298393, 0.058744896203279495, 0..."
9,Onboarding Presentation 2024.pdf\n\nWelcome to...,"[-0.026788972318172455, 0.03693313151597977, -..."


## 2.2. Search

Now we'll define a search function that:
- Takes a user query and a dataframe with text & embedding columns
- Embeds the user query with the OpenAI API
- Uses distance between query embedding and text embeddings to rank the texts
- Returns two lists:
    - The top N texts, ranked by relevance
    - Their corresponding relevance scores

In [25]:
# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = client.embeddings.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [27]:
search_query = "What is the best way to get highest score in sunflower game?"

strings, relatednesses = strings_ranked_by_relatedness(search_query, df, top_n=5)
for string, relatedness in zip(strings, relatednesses):
    print(f"{relatedness=:.3f}")
    display(string)

relatedness=0.479


"Sunflower game.pdf\n\nSunflower game\nAt Sloboda Studio, we believe the foundation of success is the collective effort\nand creativity of our team. You're not just doing your job; you're fueling our\ngrowth by recommending us, sharing our vision, innovating for our future,\nand so much more.\nWe're immensely grateful for your contributions - and we think it's\nimportant that your efforts blossomed into rewards. Enter the Sunflower\nGame, a playful yet rewarding way to earn delightful perks.\nHow It Works\nAs you shine in your role and sprinkle a little extra magic, you'll gather\nSunflower Points (☀️) across various categories. These points can be\nexchanged for enticing rewards from our prize pool.\nEarning Sunflower Points\nHere’s how you can brighten your garden:\nCategories and options:\nCategory\nDetails\nPoints\nCompany Development\nIdea\nPropose new processes, tools for improved\ndepartment functionality, or innovative\nlanguages. Share your thoughts with the\nCEO, HR, or PM.\n

relatedness=0.223


"Onboarding Presentation 2024.pdf\n\nin \na \nrelaxed \nand \nenjoyable atmosphere.\n➔\nSloboda Challenges: Exciting monthly \nchallenges \nthat \nadd \na \ntouch \nof \nplayfulness to our routine, you can have \nfun and win a prize.\nEvents\nFree English lessons\nCorporate English lessons at 4 \nlevels (Pre-Intermediate, \nIntermediate, Upper-Intermediate, \nAdvanced)\nCertification coverage up \nto $1,000. \nExternal courses fund half \nof fees and the full cost of \nconferences (up to $1,000)\nYou will find certification \npolicy here.\nWith care for you\nReferral Program Bonuses for \nattracting your friends and \nex-colleagues to the company.\nWe offer access to therapy \nservices or can cover therapy \nexpenses by 50% or up to 500 UAH \nper session.\nCovering 50% of costs for \nwellness services, like massages, \nand medical check-ups, ensuring \nyou feel cared for and valued.\nIf you're getting married or having \na newborn child, give us a heads \nup, and we'll congratulate you

relatedness=0.119


"Onboarding Presentation 2024.pdf\n\nas Sloboda Talks, Tech Talks, Yoga Hour, etc. \n3.\nBreak Time Hub: Where we step away for a \nmoment. Share when you're on a break or \naway from your desk, and reconnect when \nyou're back!\n4.\nGeneral Announcements: A central space \nfor important updates and news relevant \nto all employees across the company.\n5.\nDevelopment Hub: This space is \ndedicated to sending inquiries related to \ntechnologies and development. \nPersonal signature in email\nEach specialist must have a single corporate signature when sending letters.\nTo do it, you need:\n1.\nTo visit https://www.hubspot.com/email-signature-generator\n2.\nIn the first section choose Template #3\n3.\nIn the second section, fill in the following fields: First Name, Last Name, Job Title, Company \nName - Sloboda Studio, Mobile Phone Number, Website URL https://sloboda-studio.com/, \nEmail Address, Custom Field Web Development with Care \nNB: The rest of the fields do not need to be filled

relatedness=0.107


"Vacation policy.pdf\n\nVacation and Sick Leave Policy\nLeave Accrual\nFor you, 18 days are accrued annually at a rate of 1.5 days per month.\nWith 3+ years at Sloboda, this increases to 19 days, and 20 days for 5+ years.\nUpon starting your probationary period, you're immediately entitled to 5 sick days per year, which\ncan be taken as needed. Sick days are accrued annually starting from the date of hire as an in-house\nemployee.\nUsage\nAccrual starts from day one, but leaves can be taken after 3 months. However, if we have already\ncollaborated on a freelance basis for three months prior, vacation is accrued immediately and\navailable the following month.\nCoordinate leave requests two weeks ahead via our portal\nExcess is paid in the next payroll.\nCaps and Payouts\nMaximum rollover is 27 days.\nMonitor accrued leave on PF and in your personal payroll document.\nPayouts are based on the average over the past 12 consecutive months.\nUnused leaves are paid upon termination, but not f

relatedness=0.101


"Certification & Education reimbursement policy.pdf\n\nCertification & Education Reimbursement Policy\nAt Sloboda Studio, we're all about helping you grow professionally. That's why we've set\nup a simple policy to support you in attending external courses and conferences or\ngetting certifications that boost your skills.\nWho's Eligible?\nFull-time employees, with us for 6+ months.\nHow to Apply?\nJust chat with HR before starting your certification journey.\nMoney Matters\nFor Certification\n●\nPass the Certification: We cover the full cost (up to $1,000).\n●\nDon't Pass? No worries, we still cover half the cost within abovementioned limit.\nFor external Courses\n●\nWe cover half the cost (up to $1,000).\nFor Conferences\n●\nWe cover the full cost (up to $1,000) *\n* prior Direct Manager approval is needed\nRemember!\nAfter the completion / pass: Show your certificate to HR, and we'll handle the payment.\nEvery case is unique. We're here to discuss and assist further if you need it. 

## 2.3. Ask

With the search function above, we can now automatically retrieve relevant knowledge and insert it into messages to GPT.

Below, we define a function `ask` that:
- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer

In [33]:
def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    introduction = 'Use the below article to answer the subsequent question. If the answer cannot be found, write "I dont know..."'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nDoc section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question


def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    if print_message:
        print(message)
    messages = [
        {"role": "system", "content": "You answer questions about Sloboda-studio internal rules"},
        {"role": "user", "content": message},
    ]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0
    )
        
    response_message = response.choices[0].message.content
    
    print("\n\n>>> Usage: ", response.usage)
    
    return response_message

In [38]:
ask(search_query, token_budget=1500, print_message=False)



>>> Usage:  CompletionUsage(completion_tokens=109, prompt_tokens=1212, total_tokens=1321, prompt_tokens_details={'cached_tokens': 1024}, completion_tokens_details={'reasoning_tokens': 0})


'The best way to get the highest score in the Sunflower Game is to participate in activities that earn the most Sunflower Points. The highest point-earning activity listed is "Public Speaking," which awards 50☀️ points for representing the company at conferences, meetups, or internal trainings. Other high-scoring activities include "Candidate Referral" (20☀️ points) and "Marketing Article Assistance" (20☀️ points). Engaging in these activities, along with consistently contributing in other categories, will maximize your score.'