# Retrieval Enhanced Generative Question Answering with OpenAI

Welcome to the era of A.I. where technologies like chatgpt and bard are shaping the future. But even these technologies are prone to errors when faced with complex problems. We will se how we can enhance LLMs to get better answers for a particular field we are interested in.

We will do this with the help of a powerful tool Qdrant. The process will be divided in three steps broadly:
- 1: First We will use qdrant to store relevant data of the field we want to use our LLM for. The data will be converted to vector representation and then stored in qdrant.
- 2: After storing our data we can query it to find the most relevant information for the query.
- 3: Then we will pass this extra piece of information to the generative OpenAi model with our query, after which it will be able to answer it in a much more accurate way.

Let's start:

## Install Dependencies
Let's get started by installing the packages needed for notebook to run:

In [1]:
!pip install -qU openai==0.27.8 qdrant-client==1.3.1 datasets==2.13.1 tqdm==4.65.0

## Import libraries

In [2]:
import os
import openai
from datasets import load_dataset
from tqdm.auto import tqdm
from qdrant_client import QdrantClient
from qdrant_client.http import models
from pathlib import Path
from IPython.display import Latex
from tqdm.auto import tqdm
from time import sleep

C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
C:\Users\karti\AppData\Local\Programs\Python\Python311\Lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


## Initialize connection to OpenAi

In [4]:
# get API key from top-right dropdown on OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") or "OPENAI_API_KEY"
openai.Engine  # check we have authenticated

openai.api_resources.engine.Engine

As we know majority of the questions we pose to OpenAi generative model will be answered correctly.

In [14]:
query = "what is the unit of measurement of sound?"

# now query gpt-3.5-turbo WITHOUT context
res = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": query}],
    temperature=0,
    max_tokens=400,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
)
res["choices"][0]["message"]["content"]

'The unit of measurement of sound is the decibel (dB).'

But if we pose complex problems with dilemmas or even simple problems which are not so common or their answers are not prewritten it can give wrong answers. But first let us write a simple function to do querying.

In [18]:
def complete(prompt):
    # query gpt-3.5-turbo
    res = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,  # controls the creativity and randomness of the output.Higher values (e.g., 0.8) make the output more random.
        max_tokens=500,  # maximum number of tokens the model can generate in the response
        top_p=1,  # A higher value (e.g., 0.8) will encourage a wider range of possible tokens, leading to more diverse responses
        frequency_penalty=0,  # A higher value (e.g., 0.8) makes the model less likely to repeat words
        presence_penalty=0,  # A higher value (e.g., 0.8) makes the model more likely to include the keywords from the prompt
    )
    return res["choices"][0]["message"]["content"]

Let's try to ask a question from mathematics, not a very complex question but complex enough to get wrong answer from OpenAi model.

In [19]:
query = "The product of two consecutive page numbers is 20022. What is the sum of the two page numbers?"
Latex(complete(query))  # using Latex to properly display mathematics expressions

<IPython.core.display.Latex object>

Some calculation can show the answer is wrong, the correct answer is 283. We will use RAG(retrieval augmented generation) to solve this.
Let's see what happens if we provide answer of a similar problem as a reference to our question.

In [20]:
query = (
    "Let the page numbers be $n$ and $n + 1.$ Then, the problem can be modeled by the equation $n(n+1) = 18360."
    "$ We can rewrite the equation as $n^2 + n - 18360=0.$ Now using the quadratic formula, we find that"
    "$$n = \frac{-1 \pm \sqrt{1 + 4\cdot 18360}}{2}.$$ So, $n = 135.$ Hence, $n + (n + 1) = {271}."
    "$ This equation can be factored as well, but that would not save much time. The best way to solve this"
    "quickly would be to notice that $18360$ falls between $135^2=18225$ and $136^2=18496,$ so since we know"
    "that $n$ is an integer, we can guess that $n = 135.$ Plugging it back into the equation, we see that it works,"
    "so $n + (n + 1) = {271}.$"
    "The product of two consecutive page numbers is 20022. What is the sum of the two page numbers?"
)
Latex(complete(query))

<IPython.core.display.Latex object>

and we get the correct answer as 283.

## Initialize Retriever

To store our data in qdrant we need to convert the data in vector representations which capture the semantic meaning of our data and later cosine similarity is used to match the query with our data to find the best matching data. There are many options for creating vector embeddings for our data. We will use OpenAi model **ada** to do so.

In [21]:
embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch",
    ],
    engine=embed_model,
)

The `res` we get will be a json like object with the embedding in the `data` field.

In [22]:
res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

We have two records for each sentences. The `text-embedding-ada-002` model's output dimensionality is `1536`.

In [23]:
len(res["data"])

2

In [24]:
len(res["data"][0]["embedding"]), len(res["data"][1]["embedding"])

(1536, 1536)

### Load Dataset

We will use **qwedsacf/competition_math** dataset from hugging face which consists of problems from mathematics competitions, and their full step-by-step solution, which can be used to teach models to generate answer derivations and explanations.

In [25]:
df = load_dataset("qwedsacf/competition_math", split="train").to_pandas()
df.head()

Found cached dataset parquet (C:/Users/karti/.cache/huggingface/datasets/qwedsacf___parquet/qwedsacf--competition_math-7113d5674a916e94/0.0.0/14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7)


Unnamed: 0,problem,level,type,solution
0,"Let \[f(x) = \left\{\n\begin{array}{cl} ax+3, ...",Level 5,Algebra,"For the piecewise function to be continuous, t..."
1,A rectangular band formation is a formation wi...,Level 5,Algebra,Let $x$ be the number of band members in each ...
2,What is the degree of the polynomial $(4 +5x^3...,Level 3,Algebra,This polynomial is not written in standard for...
3,Evaluate $\left\lceil3\left(6-\frac12\right)\r...,Level 3,Algebra,"Firstly, $3\left(6-\frac12\right)=18-1-\frac12..."
4,Sam is hired for a 20-day period. On days that...,Level 3,Algebra,Call $x$ the number of days Sam works and $y$ ...


We only need the **problem** and **solution** column from the dataset. We will create vector embeddings of problems and the corresponding solutions will be stored as payload in the qdrant collection which will be used as secondary source of information for our query.

In [26]:
df = df.drop(columns=["level", "type"])  # drop unneccessay columns
df = df.replace("boxed", "", regex=True)  # remove substring boxed from dataframe
df.head()

Unnamed: 0,problem,solution
0,"Let \[f(x) = \left\{\n\begin{array}{cl} ax+3, ...","For the piecewise function to be continuous, t..."
1,A rectangular band formation is a formation wi...,Let $x$ be the number of band members in each ...
2,What is the degree of the polynomial $(4 +5x^3...,This polynomial is not written in standard for...
3,Evaluate $\left\lceil3\left(6-\frac12\right)\r...,"Firstly, $3\left(6-\frac12\right)=18-1-\frac12..."
4,Sam is hired for a 20-day period. On days that...,Call $x$ the number of days Sam works and $y$ ...


## Initialize Qdrant client

In [27]:
# Initialize Qdrant client

current_folder = Path.cwd()  # Get the current folder
qdrant_folder = current_folder / "qdrant"
qdrant_folder.mkdir()  # Create qdrant folder to store collection

client = QdrantClient(path=qdrant_folder.resolve())  # path to new qdrant folder

## Create collection

In [28]:
collection_name = "openai-math-problems"

collections = client.get_collections()
print(collections)

# only create collection if it doesn't exist
if collection_name not in collections:
    client.recreate_collection(
        collection_name=collection_name,
        vectors_config=models.VectorParams(
            size=1536,  # specifying dimensionality of vectors output by model
            distance=models.Distance.COSINE,  # specifying which metric will be used to check similarity of vectors
        ),
    )
collections = client.get_collections()
print(collections)

collections=[]
collections=[CollectionDescription(name='openai-math-problems')]


## Generate Embeddings -> Store in Qdrant
Now we will generate embeddings for our problems. We will do so in batches which is much faster than doing it individually. And then send a single api call to upsert the batch (also much faster).

In qdrant, we need an id (a unique value), embedding (embeddings for the problems), and metadata for each document in the dataset. The metadata is a dictionary containing data relevant to our embeddings.

In [29]:
%%time

batch_size = 1024  # specify batch size according to your RAM and compute, higher batch size = more RAM usage

for i in tqdm(range(0, len(df), batch_size)):
    i_end = min(i + batch_size, len(df))  # find end of batch
    batch = df.iloc[i:i_end]  # extract batch
    ids = list(range(i, i_end))  # create unique IDs

    # create embeddings (try-except added to avoid RateLimitError)
    try:
        res = openai.Embedding.create(
            input=batch["problem"].tolist(), engine=embed_model
        )
    except:
        done = False
        while not done:
            sleep(5)
            try:
                res = openai.Embedding.create(
                    input=batch["problem"].tolist(), engine=embed_model
                )
                done = True
            except:
                pass
    embeds = [record["embedding"] for record in res["data"]]

    meta = batch.to_dict(orient="records")  # get metadata

    # upsert to qdrant
    client.upsert(
        collection_name=collection_name,
        points=models.Batch(ids=ids, vectors=embeds, payloads=meta),
    )

collection_vector_count = client.get_collection(
    collection_name=collection_name
).vectors_count
print(f"Vector count in collection: {collection_vector_count}")
assert collection_vector_count == len(df)

  0%|          | 0/13 [00:00<?, ?it/s]

Vector count in collection: 12500
CPU times: total: 1min 1s
Wall time: 4min 25s


Now we can search after with query vectors:

In [30]:
query = "The product of two consecutive page numbers is 20022. What is the sum of the two page numbers?"

res = openai.Embedding.create(input=[query], engine=embed_model)
encoded_query = res["data"][0]["embedding"]

res = client.search(
    collection_name=collection_name,
    query_vector=encoded_query,
    limit=2,
)

In [31]:
res

[ScoredPoint(id=71, version=0, score=0.9155836135758795, payload={'problem': 'The product of two consecutive page numbers is $18{,}360.$ What is the sum of the two page numbers?', 'solution': 'Let the page numbers be $n$ and $n + 1.$ Then, the problem can be modeled by the equation $n(n+1) = 18360.$ We can rewrite the equation as $n^2 + n - 18360=0.$\n\nNow using the quadratic formula, we find that $$n = \\frac{-1 \\pm \\sqrt{1 + 4\\cdot 18360}}{2}.$$ So, $n = 135.$ Hence, $n + (n + 1) = \\{271}.$\n\nThis equation can be factored as well, but that would not save much time. The best way to solve this quickly would be to notice that $18360$ falls between $135^2=18225$ and $136^2=18496,$ so since we know that $n$ is an integer, we can guess that $n = 135.$ Plugging it back into the equation, we see that it works, so $n + (n + 1) = \\{271}.$'}, vector=None),
 ScoredPoint(id=10877, version=0, score=0.8920983857402267, payload={'problem': 'The product of two positive whole numbers is 2005. I

Let us write a function **retrieve** which will retrieve relevant solutions from qdrant for us.

In [32]:
limit = 3750


def retrieve(query):
    res = openai.Embedding.create(input=[query], engine=embed_model)

    # retrieve from Qdrant
    encoded_query = res["data"][0]["embedding"]

    # get relevant solutions
    res = client.search(
        collection_name=collection_name,
        query_vector=encoded_query,
        limit=2,
    )
    contexts = [x.payload["problem"] + x.payload["solution"] for x in res]

    # build our prompt with the retrieved solutions included
    prompt_start = "Answer the question based on the context below." + "Context:"
    prompt_end = f"Question: {query}\nAnswer: "
    # append solutions until hitting limit
    for i in range(1, len(contexts)):
        if len("---".join(contexts[:i])) >= limit:
            prompt = prompt_start + "\n---\n".join(contexts[: i - 1]) + prompt_end
            break
        elif i == len(contexts) - 1:
            prompt = prompt_start + "---".join(contexts) + prompt_end
    # print(prompt)
    return prompt

In [33]:
# first we retrieve relevant items from qdrant
query_with_contexts = retrieve(query)

Latex(query_with_contexts)

<IPython.core.display.Latex object>

In [34]:
# then we complete the context-infused query
Latex(complete(query_with_contexts))

<IPython.core.display.Latex object>

In [40]:
query = "Solve for $x>0$ in the following arithmetic sequence: $1^2, x^2, 3^2, \ldots$."
Latex(complete(query))

<IPython.core.display.Latex object>

Again we get wrong answer. Now let's try with retrieval technique

In [41]:
query_with_contexts = retrieve(query)
Latex(complete(query_with_contexts))

<IPython.core.display.Latex object>

We get the correct answer, you can query it on more questions.

In [None]:
client.delete_collection(collection_name=collection_name)

---