[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/gen-qa-openai.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/gen-qa-openai.ipynb)

# Retrieval Enhanced Generative Question Answering with OpenAI

#### Fixing LLMs that Hallucinate

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a generative OpenAI model to generate an answer backed by real data sources. Required installs for this notebook are:

In [1]:
!pip install -qU \
    openai==1.66.3 \
    pinecone==6.0.2 \
    pinecone-datasets==1.0.2 \
    pinecone-notebooks==0.1.1 \
    tqdm

---

## Building a Knowledge Base

Building more reliable LLMs tools requires an external _"Knowledge Base"_, a place where we can store and use to efficiently retrieve information. We can think of this as the external _long-term memory_ of our LLM.

We will need to retrieve information that is semantically related to our queries, to do this we need to use _"dense vector embeddings"_. These can be thought of as numerical representations of the *meaning* behind our sentences.

There are many options for creating these dense vectors, like open source [sentence transformers](https://pinecone.io/learn/nlp/) or OpenAI's [ada-002 model](https://youtu.be/ocxq84ocYi0). We will use OpenAI's offering in this example.

### Demo Data: Youtube Transcripts

We have already precomputed the embeddings here to speed things up. If you'd like to work through the full process however, check out [this notebook](https://github.com/pinecone-io/examples/blob/master/learn/generation/openai/gen-qa-openai.ipynb).

To download our precomputed embeddings we use Pinecone datasets:

In [2]:
from pinecone_datasets import load_dataset

dataset = load_dataset('youtube-transcripts-text-embedding-ada-002')

# We drop empty 'metadata' column
dataset.documents.drop(['metadata'], axis=1, inplace=True)
# Rename the 'blob' column to 'metadata'
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)

# View a few records
dataset.head()


Loading documents parquet files: 100%|██████████| 1/1 [00:34<00:00, 34.11s/it]


Unnamed: 0,id,values,sparse_values,metadata
0,35Pdoyi6ZoQ-t0.0,"[-0.010402066633105278, -0.018359748646616936,...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."
1,35Pdoyi6ZoQ-t18.48,"[-0.011849376372992992, 0.0007984379190020263,...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."
2,35Pdoyi6ZoQ-t32.36,"[-0.014534404501318932, -0.0003158661129418760...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."
3,35Pdoyi6ZoQ-t51.519999999999996,"[-0.011597747914493084, -0.007550035137683153,...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."
4,35Pdoyi6ZoQ-t67.28,"[-0.015879768878221512, 0.0030445053707808256,...",,"{'channel_id': 'UCv83tO5cePwHMt1952IVVHw', 'en..."


Let's take a closer look at one of these rows

In [3]:
row1 = dataset.documents.iloc[0:1].to_dict(orient="records")[0]
row1

{'id': '35Pdoyi6ZoQ-t0.0',
 'values': array([-0.01040207, -0.01835975, -0.00418702, ...,  0.00098548,
        -0.03338869,  0.00290606], shape=(1536,)),
 'sparse_values': None,
 'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
  'end': 74,
  'published': '2021-07-06 13:00:03 UTC',
  'start': 0,
  'text': "Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things 

In [4]:
dimension = len(row1['values'])
print(f"The embeddings in this dataset have dimension {dimension}")

The embeddings in this dataset have dimension 1536


Now we need a place to store these embeddings and enable a efficient _vector search_ through them all. To do that we use Pinecone.

## Creating an Index

Now the data is ready, we can set up our index to store it.

We begin by instantiating a Pinecone client. To do this we need a [free API key](https://app.pinecone.io).

In [5]:
import os

if not os.environ.get("PINECONE_API_KEY"):
    from pinecone_notebooks.colab import Authenticate
    Authenticate()

In [6]:
from pinecone import Pinecone

api_key = os.environ.get("PINECONE_API_KEY")

# Configure client
pc = Pinecone(api_key=api_key)

In [7]:
from pinecone import ServerlessSpec

index_name = 'gen-qa-openai-fast'

# Check if index already exists (it shouldn't if this is first time running this demo)
if not pc.has_index(name=index_name):
    # If does not exist, create index
    pc.create_index(
        name=index_name,
        dimension=dimension, # dimensionality of text-embedding-ada-002
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        )
    )

# Instantiate an index client
index = pc.Index(name=index_name)

# View index stats of our new, empty index
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

We can see the index is currently empty with a `total_vector_count` of `0`. 

We can begin populating it with OpenAI `text-embedding-ada-002` built embeddings like so:

In [15]:
from tqdm import tqdm

batch_size = 100

for start in tqdm(range(0, len(dataset.documents), batch_size), "Upserting records batch"):
    batch = dataset.documents.iloc[start:start + batch_size].to_dict(orient="records")
    index.upsert(vectors=batch)


Upserting records batch: 100%|██████████| 390/390 [05:45<00:00,  1.13it/s]


Now we've added all of our youtube transcripts and their embeddings to the index. With that we can move on to retrieval and then answer generation.

## Retrieval with Pinecone

To search through our documents we first need to create a query vector `xq`. Then, using `xq` we will retrieve the most relevant chunks from our index. 

To create that query vector we will again use OpenAI's `text-embedding-ada-002` model. For this, you need an [OpenAI API key](https://platform.openai.com/).

In [16]:
def create_embedding(query):
    from openai import OpenAI

    # Get OpenAI api key from platform.openai.com
    openai_api_key = os.getenv('OPENAI_API_KEY') or 'sk-...'

    # Instantiate the OpenAI client
    client = OpenAI(api_key=openai_api_key)

    # Create an embedding
    res = client.embeddings.create(
      model="text-embedding-ada-002",
      input=[query],
    )
    return res.data[0].embedding

In [18]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

xq = create_embedding(query)

# Retrieve from Pinecone
# Get relevant contexts (including the questions)
query_results = index.query(vector=xq, top_k=2, include_metadata=True)
query_results

{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 568.0,
                           'published': '2021-11-24 16:24:24 UTC',
                           'start': 418.0,
                           'text': 'pairs of related sentences you can go '
                                   'ahead and actually try training or '
                                   'fine-tuning using NLI with multiple '
                                   "negative ranking loss. If you don't have "
                                   'that fine. Another option is that you have '
                                   'a semantic textual similarity data set or '
                                   'STS and what this is is you have so you '
                                   'have sentence A here, sentence B here and '
                                   'then you have a score from from 0 to 1 '
                                   'tha

## Building a chat completion prompt with relevant context

Next, we write some functions to retrieve these relevant contexts from Pinecone and incorporate them into a richer chat completion prompt.

In [19]:
def retrieval_augmented_prompt(query):
    context_limit = 3750
    xq = create_embedding(query)

    # Get relevant contexts
    query_results = index.query(vector=xq, top_k=3, include_metadata=True)
    contexts = [
        x.metadata['text'] for x in query_results.matches
    ]

    # Build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    context_separator = "\n\n---\n\n"

    # Join contexts and trim to fit within limit
    combined_contexts = []
    total_length = 0
    
    for context in contexts:
        new_length = total_length + len(context) + len(context_separator)
        if new_length >= context_limit:
            break
        combined_contexts.append(context)
        total_length = new_length
    
    return prompt_start + context_separator.join(combined_contexts) + prompt_end

In [20]:
prompt_with_context = retrieval_augmented_prompt(query)
print(prompt_with_context)

Answer the question based on the context below.

Context:
pairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can fine

## Generating knowledgeable answers with RAG

Now that we are building a rich prompt with context from our index, we are ready to get chat completions from OpenAI.

In [21]:
def chat_completion(prompt):
    from openai import OpenAI

    # Get OpenAI api key from platform.openai.com
    openai_api_key = os.getenv('OPENAI_API_KEY') or 'sk-...'

    # Instantiate the OpenAI client
    client = OpenAI(api_key=openai_api_key)
    
    # Instructions
    sys_prompt = "You are a helpful assistant that always answers questions."
    res = client.chat.completions.create(
        model='gpt-4o-mini-2024-07-18',
        messages=[
            {"role": "system", "content": sys_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    return res.choices[0].message.content.strip()

In [22]:
def rag(query):
    prompt = retrieval_augmented_prompt(query)
    return chat_completion(prompt)

In [23]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

# Now we can get completions for a context-infused query
answer = rag(query)
print(answer)

You should use a training method that involves fine-tuning with pairs of related sentences using a Siamese architecture. This approach allows you to optimize the weights within the model to reduce the difference between the vector embeddings of the sentence pairs. You can also consider using a negative ranking loss if available, or alternatively, you can use a semantic textual similarity dataset to train with cosine similarity loss.


And we get a pretty great answer straight away, specifying to use _multiple-rankings loss_ (also called _multiple negatives ranking loss_).

## Demo cleanup

Once we're done with the index we can delete our index to save resources:

In [None]:
pc.delete_index(name=index_name)

---