# Question Answering using Embeddings

based on: https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

In this notebook, we experiment with asking GPT-3 to answer questions using a library of podcast transcripts as a reference. We achieve this by using document embeddings and retrieval as an intermediate step in the question-answer process.

In [1]:
# installs
%pip install openai

Note: you may need to restart the kernel to use updated packages.


In [4]:
import pandas as pd
import openai
import numpy as np
import pickle
from transformers import GPT2TokenizerFast

COMPLETIONS_MODEL = "text-davinci-002"

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [5]:
# note that I use a config file to store my API keys.
import config
# openai.api_key = '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'

openai.api_key = config.openai_apikey
# or, use openai.api_key = 'your-api-key'

## Motivating Example

From the sample notebook: `By default, GPT-3 isn't an expert on the 2020 Olympics:`

In [4]:
prompt = "Who won the 2020 Summer Olympics men's high jump?"

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"The 2020 Summer Olympics men's high jump was won by Mariusz Przybylski of Poland."

Now, let's try a business question I'd like to answer: How does ScaleAI make money?

In [6]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: How does ScaleAI make money?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

From the sample notebook: `Mariusz Przybylski is a professional footballer from Poland, and not much of a high jumper! Evidently GPT-3 needs some assistance here.`

`The first issue to tackle is that the model is hallucinating an answer rather than telling us "I don't know". This is bad because it makes it hard to trust the answer that the model gives us!`

In [5]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

From the sample notebook: `To help the model answer the question, we provide extra contextual information in the prompt. When the total required context is short, we can include it in the prompt directly. For example we can use this information taken from Wikipedia. We update the initial prompt to tell the model to explicitly make use of the provided text.`

In [6]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Gianmarco Tamberi and Mutaz Essa Barshim won the 2020 Summer Olympics men's high jump."

## Processing the podcast data

To replicate something like the structure of the above (where we help the algorithm "zoom in" on a specific body of text from which to pull the answer), we are going to use the following process:
- compare the word embedding vector of the query (i.e. "what is the metaverse?") to the word embedding vector of each podcast section and compute similarity (akin to dot product or cosine similarity)
- select the highest-ranked one or couple of candidate word embedding vectors, subject to a character/token limit
- pass the selected section text as an addendum to the query to help the model answer the question

In [6]:
# read in the data
df = pd.read_csv('preprocessed.csv')
df = df.set_index(["title", "heading"])
df = df.drop('Unnamed: 0', axis=1)
print(f"{len(df)} rows in the data.")
df.sample(5)

356 rows in the data.


Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
"Martin Casado - The Past, Present, and Future of Digital Infrastructure",Exciting Possibilities Offered By Digital Infrastructure,Patrick: And as you think about the ways that...,1278
John Pfeffer - Adapt and Evolve,Sources of Capital Efficiency and Value Creation,Patrick: Why would that be good? Thinking abo...,2877
Eric Mandelblatt - Investing in the Industrial Economy,How a Commodities Business Operates,"Patrick: All right, so maybe we could shift n...",2662
Dmitry Balyasny - Building a Better Model,Outperforming with Multiple Investing Groups and Strategies,Patrick: Maybe you can just describe in some ...,2489
Kenneth Stanley - Greatness Without Goals,The Story of Picbreeder,Patrick: In the book and in the presentation ...,3929


Models used:
- `text-search-curie-query-001`: for embedding the search query
- `text-search-curie-doc-001`: for embedding the documents to be retrieved (constructing a 4096-dimensional word embedding vector)

More information: `https://beta.api.openai.org/docs/guides/embeddings/what-are-embeddings`

In [7]:
MODEL_NAME = "curie"

DOC_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-doc-001"
QUERY_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-query-001"

In [8]:
# functions as defined in the sample notebook

def get_embedding(text: str, model: str) -> list[float]:
    result = openai.Embedding.create(
      model=model,
      input=text
    )
    return result["data"][0]["embedding"]

def get_doc_embedding(text: str) -> list[float]:
    return get_embedding(text, DOC_EMBEDDINGS_MODEL)

def get_query_embedding(text: str) -> list[float]:
    return get_embedding(text, QUERY_EMBEDDINGS_MODEL)

def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """

    return {
        idx: get_doc_embedding(r.content.replace("\n", " ")) for idx, r in df.iterrows()
    }


We truncate each podcast section to a maximum of 2,000 tokens (i.e. around 8,000 characters). If I were to repeat this exercise, I might consider breaking the podcasts into more granular chunks (such as each interviewer question + the guest's answer). For now, the truncation seems to work okay.

In [9]:
# function to truncate each section (text string) up to a configurable number of tokens

def truncate_by_token_max(input_str: str, max_tokens: int):
    max_chars = max_tokens * 4
    output_str = input_str

    if len(input_str) > max_chars:
        # print('length of input str:' + str(len(input_str)))
        output_str = input_str[:max_chars]
        # print('length of output str:' + str(len(output_str)))
    return output_str

In [10]:
df['content'] = df['content'].apply(lambda x: truncate_by_token_max(x, 2000))
df[:10]

Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
Gabriel Leydon - How Web3 Onboards a Billion Users,Introduction,"Patrick: My guest today is Gabe Leydon, whose...",183
Gabriel Leydon - How Web3 Onboards a Billion Users,Free-to-Own Gaming,"Patrick: All right, Gabe, so it's been almost...",2536
Gabriel Leydon - How Web3 Onboards a Billion Users,Three Waves of NFTs,Patrick: Can you say a little bit about this ...,1517
Gabriel Leydon - How Web3 Onboards a Billion Users,DigiDaigaku,"Patrick: Can you tell the story then of Digi,...",2411
Gabriel Leydon - How Web3 Onboards a Billion Users,NFTs Ability to Change Marketing,Patrick: One of the things that jumps out of ...,4623
Gabriel Leydon - How Web3 Onboards a Billion Users,"AI, Innovation, & The Future",Patrick: It's exciting that some emotive resp...,3464
Harley Finkelstein - Building the Entrepreneurship Company,Introduction,Patrick: My guest today is Harley Finkelstein...,110
Harley Finkelstein - Building the Entrepreneurship Company,Finding Your Life's Work,"Patrick: So Harley, maybe the place to begin ...",3006
Harley Finkelstein - Building the Entrepreneurship Company,The Entrepreneurial Formula,Patrick: If you think about your just hands-o...,3096
Harley Finkelstein - Building the Entrepreneurship Company,Delivering and Applying Good Advice in Business,Patrick: One of the things I'm obviously obse...,1833


## Creating the Document Embeddings

NOTE: I needed to sign up for an API key in order to run the below code for the full dataset - it cost approximately $6 (see pricing: `https://openai.com/api/pricing/`).

If you are replicating this analysis with a free account, the included backoff (process 30 sections every 60 sections) should adhere to the free tier rate limits if you are processing a smaller dataset.

To avoid computing the doc embeddings yourself, simply skip the below code block. (I saved the result in `context_embeddings.pickle`)

In [12]:
import time

# script to chunk the dataframe being passed to the compute_doc_embeddings() method
def rate_limit(input_df: pd.DataFrame, limit: int):

    # chunk the dataframe into a list of dfs
    print('chunk size: ' + str(limit))
    df_list = []
    if input_df.shape[0] > limit: 

        for start in range(0, len(input_df), limit):
            df_list.append(input_df[start:start+limit])

    else:
        df_list.append(input_df)

    # make a list to collect the output of the compute_doc_embeddings() method
    output_list = []
    counter = 1
    for chunk_df in df_list:

        # compute the doc embeddings
        print('computing doc embeddings for chunk {} of {}'.format(str(counter), str(len(df_list))))
        output_list.append(compute_doc_embeddings(chunk_df))

        print('sleeping 60 seconds')
        time.sleep(60)
        counter = counter + 1

    # merge the list of dictionaries
    out_dict = {}
    for result_dict in output_list:
        out_dict.update(result_dict)
    
    return out_dict



# run the above method for 100 at a time (any config should work if you have an API key and have attached a credit card)
context_embeddings = rate_limit(df, 100)

import pickle

# save dictionary to pickle file
with open('context_embeddings.pickle', 'wb') as file:
    pickle.dump(context_embeddings, file, protocol=pickle.HIGHEST_PROTOCOL)


chunk size: 100
computing doc embeddings for chunk 1 of 4
sleeping 60 seconds
computing doc embeddings for chunk 2 of 4
sleeping 60 seconds
computing doc embeddings for chunk 3 of 4
sleeping 60 seconds
computing doc embeddings for chunk 4 of 4
sleeping 60 seconds


In [11]:
# load a pickle file
with open("context_embeddings.pickle", "rb") as file:
    document_embeddings = pickle.load(file)
# display the dictionary
print(list(document_embeddings.keys())[:5])


[('Gabriel Leydon - How Web3 Onboards a Billion Users', 'Introduction'), ('Gabriel Leydon - How Web3 Onboards a Billion Users', 'Free-to-Own Gaming'), ('Gabriel Leydon - How Web3 Onboards a Billion Users', 'Three Waves of NFTs'), ('Gabriel Leydon - How Web3 Onboards a Billion Users', 'DigiDaigaku'), ('Gabriel Leydon - How Web3 Onboards a Billion Users', 'NFTs Ability to Change Marketing')]


In [12]:
# functions borrowed from the example notebook

def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    We could use cosine similarity or dot product to calculate the similarity between vectors.
    In practice, we have found it makes little difference. 
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_query_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

## Testing the Similarity Scoring of Query vs Embeddings

Let's evaluate the "most similar" documents (podcast sections) relative to the query "What is the Metaverse?".

I chose the above query because I am expecting the model to return a section of the Matthew Ball episode in which he provides a good definition of the Metaverse.

As expected, the podcast titled 'Matthew Ball - A Manual to The Metaverse' is ranked 1st, 2nd, and 4th. I didn't check the sections specifically but at a high level, the similarity scoring seems to be working well.

In [13]:
order_document_sections_by_query_similarity("What is the Metaverse?", document_embeddings)[:5]

[(0.42707484151104347,
  ('Matthew Ball - A Manual to The Metaverse',
   'Familiar Platforms with Metaverse Elements')),
 (0.42434117795735227,
  ('Matthew Ball - A Manual to The Metaverse', 'Introduction')),
 (0.4020909055483938,
  ('Bill Gurley, Philip Rosedale - Back to the Future',
   'Defining The "Metaverse"')),
 (0.396451047957024,
  ('Matthew Ball - A Manual to The Metaverse',
   'Potential Future Winners and Losers')),
 (0.3925489225813553,
  ('Bill Gurley, Philip Rosedale - Back to the Future',
   'What the Metaverse is Missing'))]

In [14]:
MAX_SECTION_LEN = 5000
SEPARATOR = "\n* "

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
separator_len = len(tokenizer.tokenize(SEPARATOR))

f"Context separator contains {separator_len} tokens"

'Context separator contains 3 tokens'

## Constructing Prompts

Now, we can construct a prompt according to the following structure:
- anti-hallucination block ("answer as truthfully as possible")
- the context (union of as many relevant podcast sections as we can fit subject to max section length)
- the query

In [15]:
def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [16]:
prompt = construct_prompt(
    "What is the Metaverse?",
    document_embeddings,
    df
)

print("===\n", prompt)

Selected 2 document sections:
('Matthew Ball - A Manual to The Metaverse', 'Familiar Platforms with Metaverse Elements')
('Matthew Ball - A Manual to The Metaverse', 'Introduction')
===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

*  Patrick: Well, round two is here. We get to talk about one of the most interesting topics in the world, especially because of how much insane detail there is under the topic of the metaverse, much of it contained in your awesome new book that I just finished a couple days ago. I think a fun place to begin picking up on our last conversation is with a couple analogy questions, specifically around things that already exist that people will be familiar with, and the degree to which you think they represent something like the metaverse. I want to start here as an anchor point, rather than go deep into the infrastructure right from the beginnin

In [17]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

In [18]:
def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[(str, str), np.array],
    show_prompt: bool = False
) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

# Questions and Answers

Finally, let's test out the question-answer system.

In [30]:
# pretty decent answer

answer_query_with_context("What is the Metaverse?", df, document_embeddings)

Selected 2 document sections:
('Matthew Ball - A Manual to The Metaverse', 'Familiar Platforms with Metaverse Elements')
('Matthew Ball - A Manual to The Metaverse', 'Introduction')


'A live 3D version of the internet as we know today is the best and simplest way to think about this.'

In [31]:
# another decent answer

answer_query_with_context("How does ScaleAI make money?", df, document_embeddings)

Selected 2 document sections:
('Alexandr Wang - A Primer on AI', 'Building the AWS of the Future')
('Alexandr Wang - A Primer on AI', 'Introduction')


'ScaleAI makes money by providing data solutions to leading AI teams. These solutions help the teams to produce high quality data, which is essential for AI models.'

In [32]:
# I was trying to probe for something like: "ScaleAI makes money by providing data labeling services to AI companies" - possibly need to do a bit more work to get there.

# still not bad, though.

answer_query_with_context("What product does ScaleAI sell?", df, document_embeddings)

Selected 2 document sections:
('Alexandr Wang - A Primer on AI', 'Building the AWS of the Future')
('Alexandr Wang - A Primer on AI', 'Introduction')


'ScaleAI sells a product that helps companies produce high quality data sets for their machine learning algorithms.'

In [20]:
# I was trying to probe for something like: "ScaleAI makes money by providing data labeling services to AI companies" - possibly need to do a bit more work to get there.

# still not bad, though.

answer_query_with_context("What is Thoma Bravo's investment strategy?", df, document_embeddings)

Selected 2 document sections:
('Orlando Bravo - The Art of Software Buyouts', 'Introduction')
('Orlando Bravo - The Art of Software Buyouts', 'Opportunity Set vs. Capital Flows')


"Thoma Bravo invests in software and technology businesses. It was Orlando who led the firm's early entry into software buyouts some 20 years ago, and he has overseen more than 350 software acquisitions since."

In [21]:
# I was trying to probe for something like: "ScaleAI makes money by providing data labeling services to AI companies" - possibly need to do a bit more work to get there.

# still not bad, though.

answer_query_with_context("What is Orlando Bravo's investment strategy?", df, document_embeddings)

Selected 3 document sections:
('Orlando Bravo - The Art of Software Buyouts', 'Introduction')
('Orlando Bravo - The Art of Software Buyouts', 'Opportunity Set vs. Capital Flows')
('Orlando Bravo - The Art of Software Buyouts', 'Private Equity & Future Return Potential')


"Orlando Bravo's investment strategy is to buy the market leaders of today in the software industry."

In [23]:
# # freeze a requirements file for the project
# %pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.
