# Question Answering using Embeddings

Many use cases require GPT-3 to respond to user questions with insightful answers. For example, a customer support chatbot may need to provide answers to common questions. The GPT models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.

In this notebook we will demonstrate a method for enabling GPT-3 able to answer questions using a library of text as a reference, by using document embeddings and retrieval. We'll be using a dataset of Wikipedia articles about the 2022 World Cup Games. 

In [1]:
import pandas as pd
import openai
import numpy as np
import pickle
from transformers import GPT2TokenizerFast
import json

COMPLETIONS_MODEL = "text-davinci-003"

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [3]:
# Azure OpenAI 
# Insert your API endpoint URL & key
openai.api_type = "azure"
openai.api_version = "2022-12-01"
openai.api_base = ""
openai.api_key = ""

By default, GPT-3 isn't an expert on the 2022 Fifa World Cup:

In [4]:
prompt = "Who won the 2022 Fifa World Cup?"

openai.Completion.create(
    model=COMPLETIONS_MODEL, 
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)["choices"][0]["text"].strip(" \n")

'The 2022 FIFA World Cup has not yet taken place, so no one has won it yet.'

To help the model answer the question, we provide extra contextual information in the prompt. When the total required context is short, we can include it in the prompt directly. For example we can use this information taken from Wikipedia. We update the initial prompt to tell the model to explicitly make use of the provided text.

In [5]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
The 2022 FIFA World Cup was an international football tournament contested by the men's national teams of FIFA's member associations and 22nd edition of the FIFA World Cup. It took place in Qatar from 20 November to 18 December 2022, making it the first World Cup held in the Arab world and Muslim world, and the second held entirely in Asia after the 2002 tournament in South Korea and Japan.

This tournament was the last with 32 participating teams, with the number of teams being increased to 48 for the 2026 edition. To avoid the extremes of Qatar's hot climate,the event was held during November and December. It was held over a reduced time frame of 29 days with 64 matches played in eight venues across five cities. Qatar entered the event—their first World Cup—automatically as the host's national team, alongside 31 teams determined by the qualification process.

Argentina were crowned the champions after winning the final against the title holder France 4–2 on penalties following a 3–3 draw after extra time. It was Argentina's third title and their first since 1986, as well being the first nation from outside of Europe to win the tournament since 2002. French player Kylian Mbappé became the first player to score a hat-trick in a World Cup final since Geoff Hurst in the 1966 final and won the Golden Boot as he scored the most goals (eight) during the tournament. Argentine captain Lionel Messi was voted the tournament's best player, winning the Golden Ball. Teammates Emiliano Martínez and Enzo Fernández won the Golden Glove, awarded to the tournament's best goalkeeper, and the Young Player Award, awarded to the tournament's best young player, respectively. With 172 goals, the tournament set a new record for the highest number of goals scored with the 32-team format, with every participating team scoring at least one goal.

The choice to host the World Cup in Qatar attracted significant criticism, with concerns raised over the country's treatment of migrant workers, women and members of the LGBT community, as well as Qatar's climate, lack of a strong football culture, scheduling changes, and allegations of bribery for hosting rights and wider FIFA corruption.

Q: Who won the 2022 Fifa World Cup?
A:"""

openai.Completion.create(
    model=COMPLETIONS_MODEL,
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)["choices"][0]["text"].strip(" \n")

'Argentina won the 2022 FIFA World Cup.'

Adding extra information into the prompt only works when the dataset of extra content that the model may need to know is small enough to fit in a single prompt. What do we do when we need the model to choose relevant contextual information from within a large body of information?

**In the remainder of this notebook, we will demonstrate a method for augmenting GPT-3 with a large body of additional contextual information by using document embeddings and retrieval.** This method answers queries in two steps: first it retrieves the information relevant to the query, then it writes an answer tailored to the question based on the retrieved information. The first step uses the [Embedding API](https://beta.openai.com/docs/guides/embeddings), the second step uses the [Completions API](https://beta.openai.com/docs/guides/completion/introduction).
 
The steps are:
* Preprocess the contextual information by splitting it into chunks and create an embedding vector for each chunk.
* On receiving a query, embed the query in the same vector space as the context chunks and find the context embeddings which are most similar to the query.
* Prepend the most relevant context embeddings to the query prompt.
* Submit the question along with the most relevant context to GPT, and receive an answer which makes use of the provided contextual information.

# Preprocess the document library / Create Documents Embeddings

We plan to use document embeddings to fetch the most relevant part of parts of our document library and insert them into the prompt that we provide to GPT-3. We therefore need to break up the document library into "sections" of context, which can be searched and retrieved separately. 

Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. We find that approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into semantically related headers, so we will use these to define our sections. 

In [6]:
# We have hosted the processed dataset, so you can download it directly without having to recreate it.
# This dataset has already been split into sections, one row for each section of the Wikipedia page.

df = pd.read_csv('data/fifa_data.csv', index_col=0)
df = df.loc[df['words'] >= 10]
print(f"{len(df)} rows in the data.")
df.sample(5)

530 rows in the data.


Unnamed: 0,source,text,words
554,wiki,The 2022 FIFA World Cup opening ceremony took ...,54
439,wiki,Paths A and B were affected by the 2022 Russia...,206
563,wiki,The draw for the second round was held on 21 J...,22
22,wiki,Group F's first match was a goalless draw betw...,211
252,wiki,One of the most discussed issues of the Qatar ...,611


We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents. See the [documentation on OpenAI embeddings](https://beta.api.openai.org/docs/guides/embeddings/) for more information.

This indexing stage can be executed offline and only runs once to precompute the indexes for the dataset so that each piece of content can be retrieved later. Since this is a small example, we will store and search the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate) to power the search.

For the purposes of this tutorial we chose to use Ada embeddings V2, which are at a very good price and performance point. Since we will be using these embeddings for retrieval, we’ll use the "search" embeddings (see the [documentation](https://openai.com/blog/new-and-improved-embedding-model/)).

In [7]:
MODEL_NAME = "ada"

DOC_EMBEDDINGS_MODEL = f"text-embedding-{MODEL_NAME}-002"
QUERY_EMBEDDINGS_MODEL = f"text-embedding-{MODEL_NAME}-002"


In [8]:
def get_doc_embedding(text: str): # -> list[float]:
    result = openai.Embedding.create(
      engine=DOC_EMBEDDINGS_MODEL,
      input=text
    )
    return result["data"][0]["embedding"]

def get_query_embedding(text: str): # -> list[float]:
    result = openai.Embedding.create(
      engine=QUERY_EMBEDDINGS_MODEL,
      input=text
    )
    return result["data"][0]["embedding"]

def compute_doc_embeddings(df): # -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_doc_embedding(str(r.text).replace("\n", " ")) for idx, r in df.iterrows()
    }

Again, we have hosted the embeddings for you so you don't have to re-calculate them from scratch.

In [9]:
document_embeddings = compute_doc_embeddings(df) # maggie - takes 1min 44s for 530 rows

In [10]:
# An example embedding:
example_entry = list(document_embeddings.items())[0] #replace with document_embeddings if using the load_embeddings call
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

# dict[tuple[str, str], list[float]]:
print("Vector size:", len(example_entry[1]))

0 : [-0.0033039876725524664, -0.015193239785730839, 0.03140701726078987, -0.010441366583108902, -0.02467147447168827]... (1536 entries)
Vector size: 1536


So we have split our document library into sections, and encoded them by creating embedding vectors that represent each chunk. Next we will use these embeddings to answer our users' questions.

# Find the most similar document embeddings to the question embedding

At the time of question-answering, to answer the user's query we compute the query embedding of the question and use it to find the most similar document sections. Since this is a small example, we store and search the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate) to power the search.

In [11]:
#def vector_similarity(x: list[float], y: list[float]): # -> float:
def vector_similarity(x, y): # -> float:
    """
    We could use cosine similarity or dot product to calculate the similarity between vectors.
    In practice, we have found it makes little difference. 
    """
    return np.dot(np.array(x), np.array(y))

#def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]): # -> list[(float, (str, str))]:
def order_document_sections_by_query_similarity(query, contexts): # -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_query_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [12]:
most_relevant_document_sections = order_document_sections_by_query_similarity("Who won Fifa World Cup?", document_embeddings)[:5]

In [13]:
for _, section_index in most_relevant_document_sections:
    # Add contexts until we run out of space.        
    document_section = df.loc[section_index]
    print(document_section.text)

The 2022 FIFA World Cup final was the final match of the 2022 FIFA World Cup, the 22nd edition of FIFA's competition for men's national football teams. The match was played at Lusail Stadium in Lusail, Qatar, on 18 December 2022, the Qatari National Day, and was contested by Argentina and defending champions France. The tournament comprised hosts Qatar and 31 other teams who emerged from the qualification phase, organised by the six FIFA confederations. The 32 teams competed in a group stage, from which 16 teams qualified for the knockout stage. En route to the final, Argentina finished first in Group C, with two wins and one loss, before defeating Australia in the round of 16, the Netherlands in the quarter-final through a penalty shoot-out and Croatia in the semi-final. France finished top of Group D with two wins and one loss, defeating Poland in the round of 16, England in the quarter-final and Morocco in the semi-final. The final took place in front of 88,966 spectators, with more

We can see that the most relevant document sections for each question are the summaries for the Men's and Women's high jump competitions - which is exactly what we would expect.

# Add the most relevant document sections to the query prompt

Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. It is helpful to use a query separator to help the model distinguish between separate pieces of text.

In [14]:
MAX_SECTION_LEN = 1000
SEPARATOR = "\n* "

'Context separator contains 3 tokens'

In [15]:
#def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
def construct_prompt(question, context_embeddings, df) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)[:5]
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
            
        chosen_sections.append(SEPARATOR + document_section.text.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [16]:
prompt = construct_prompt(
    "Who won the 2022 Fifa World Cup?",
    document_embeddings,
    df
)

print("===\n", prompt)

===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* The 2022 FIFA World Cup final was the final match of the 2022 FIFA World Cup, the 22nd edition of FIFA's competition for men's national football teams. The match was played at Lusail Stadium in Lusail, Qatar, on 18 December 2022, the Qatari National Day, and was contested by Argentina and defending champions France. The tournament comprised hosts Qatar and 31 other teams who emerged from the qualification phase, organised by the six FIFA confederations. The 32 teams competed in a group stage, from which 16 teams qualified for the knockout stage. En route to the final, Argentina finished first in Group C, with two wins and one loss, before defeating Australia in the round of 16, the Netherlands in the quarter-final through a penalty shoot-out and Croatia in the semi-final. France finished top of Group D with two wins an

We have now obtained the document sections that are most relevant to the question. As a final step, let's put it all together to get an answer to the question.

# Answer the user's question based on the context.

Now that we've retrieved the relevant context and constructed our prompt, we can finally use the Completions API to answer the user's query.

In [17]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

In [18]:
#def answer_query_with_context(
#    query: str,
#    df: pd.DataFrame,
#    document_embeddings: dict[(str, str), np.array],
#    show_prompt: bool = False
#) -> str:
def answer_query_with_context(
    query,
    df,
    document_embeddings,
    show_prompt = False
):
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

In [19]:
answer_query_with_context("Who won the 2022 Fifa World Cup?", df, document_embeddings)

'Argentina won the 2022 FIFA World Cup.'

Wow! By combining the Embeddings and Completions APIs, we have created a question-answering model which can answer questions using a large base of additional knowledge. It also understands when it doesn't know the answer! 

For this example we have used a dataset of Wikipedia articles, but that dataset could be replaced with books, articles, documentation, service manuals, or much much more. **We can't wait to see what you create with GPT-3!**

# More Examples

Let's have some fun and try some more examples.

In [20]:
query = "How did the 2022 Fifa World Cup Final Game go?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")


Q: How did the 2022 Fifa World Cup Final Game go?
A: Argentina won the 2022 FIFA World Cup Final 4–2 on penalties after a 3–3 draw with France after extra time.
