# Retrieval augmented generative question answering with Pinecone (Updated for newest openai version and pinecone version 2025-01)

ref: https://cookbook.openai.com/examples/vector_databases/pinecone/gen_qa

In [1]:
!pip install -qU openai pinecone-client datasets

In [2]:
pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [3]:
import openai
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ.get("OPENAI_API_KEY")
# print(openai.api_key)

In [4]:
client = OpenAI(
    api_key=openai.api_key
)

In [5]:
def get_completion(prompt, model="gpt-3.5-turbo"):
    response = openai.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
    )
    return response.choices[0].message.content.strip()

In [6]:
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)

get_completion(query)

'If you only have pairs of related sentences, you can use a Siamese network architecture for training sentence transformers. Siamese networks are designed to learn similarity between pairs of inputs, making them well-suited for tasks like sentence similarity or paraphrase detection. By training a Siamese network on your pairs of related sentences, you can learn a representation that captures the semantic similarity between sentences. This representation can then be used for tasks like information retrieval, question answering, or text classification.'

In [7]:
embed_model = "text-embedding-ada-002"

res = openai.embeddings.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], 
    model=embed_model
)

In [8]:
len(res.data)

2

In [9]:
len(res.data[0].embedding), len(res.data[1].embedding)

(1536, 1536)

In [10]:
!pip install datasets transformers



In [11]:
from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

  from .autonotebook import tqdm as notebook_tqdm


Dataset({
    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619
})

In [12]:
data[0]

{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
 'published': '2021-07-06 13:00:03 UTC',
 'url': 'https://youtu.be/35Pdoyi6ZoQ',
 'video_id': '35Pdoyi6ZoQ',
 'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
 'id': '35Pdoyi6ZoQ-t0.0',
 'text': 'Hi, welcome to the video.',
 'start': 0.0,
 'end': 9.36}

In [13]:
from tqdm.auto import tqdm

new_data = []

window = 20  # number of sentences to combine
stride = 4  # number of sentences to 'stride' over, used to create overlap

for i in tqdm(range(0, len(data), stride)):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    text = ' '.join(data[i:i_end]['text'])
    # create the new merged dataset
    new_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published'],
        'channel_id': data[i]['channel_id']
    })

100%|██████████| 52155/52155 [01:18<00:00, 661.39it/s]


In [14]:
new_data[0], new_data[1]

({'start': 0.0,
  'end': 74.12,
  'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
  'text': "Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.",
  'id': '35Pdoyi6ZoQ-t0.0',
  'url': 'https://youtu.be/35Pdoyi6ZoQ',
  'published': '2021-07-

In [15]:
import pinecone
from pinecone import Pinecone, ServerlessSpec

index_name = 'openai-youtube-transcriptions'

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv('PINECONE_API_KEY')
index_name = 'openai-youtube-transcriptions'

pc = Pinecone(api_key=api_key)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        name=index_name,
        dimension=len(res.data[0].embedding),
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws',  
            region='us-east-1'  
        ),
    )



In [16]:
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 800}},
 'total_vector_count': 800}

In [17]:
# Another method to view index stats
stats = pc.describe_index(index_name)
print(stats)

{'deletion_protection': 'disabled',
 'dimension': 1536,
 'host': 'openai-youtube-transcriptions-4c82npc.svc.aped-4627-b74a.pinecone.io',
 'metric': 'cosine',
 'name': 'openai-youtube-transcriptions',
 'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
 'status': {'ready': True, 'state': 'Ready'}}


In [20]:
from tqdm.auto import tqdm
from time import sleep

batch_size = 80  # how many embeddings we create and insert at once
end_ = 800 # or len(new_data) 48000+
for i in tqdm(range(0, 800, batch_size)):
    # find end of batch
    i_end = min(1000, i+batch_size)
    meta_batch = new_data[i:i_end]
    # get ids
    ids_batch = [x['id'] for x in meta_batch]
    # get texts to encode
    texts = [x['text'] for x in meta_batch]
    res = openai.embeddings.create(input=texts, model=embed_model)
    embeds = [record.embedding for record in res.data]
    # cleanup metadata
    meta_batch = [{
        'start': x['start'],
        'end': x['end'],
        'title': x['title'],
        'text': x['text'],
        'url': x['url'],
        'published': x['published'],
        'channel_id': x['channel_id']
    } for x in meta_batch]
    to_upsert = list(zip(ids_batch, embeds, meta_batch))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

100%|██████████| 10/10 [03:21<00:00, 20.11s/it]


In [22]:
"""
query = (
    "Which training method should I use for sentence transformers when " +
    "I only have pairs of related sentences?"
)
"""

res = openai.embeddings.create(
    input=[query],
    model=embed_model
)

# retrieve from Pinecone
xq = res.data[0].embedding

# get relevant contexts (including the questions)
res = index.query(vector=xq, top_k=2, include_metadata=True)

In [23]:
res

{'matches': [{'id': 'x1lAcT3xl5M-t534.0',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 578.0,
                           'published': '2021-05-27 16:15:39 UTC',
                           'start': 534.0,
                           'text': 'that to our actual training data. '
                                   "Otherwise if it's just a single sentence "
                                   "we don't actually add it to the training "
                                   'data. I mean ideally we would want to do '
                                   'something like that but for this use case '
                                   "I don't want to get make things too "
                                   "complicated. So the reason I'm doing that "
                                   'is for example this sentence is just a '
                                   'single sentence in that paragraph and we '
                                   "can

In [30]:
limit = 3750

def retrieve(query):
    res = openai.embeddings.create(
        input=[query],
        model=embed_model
    )

    # retrieve from Pinecone
    xq = res.data[0].embedding

    # get relevant contexts
    res = index.query(vector=xq, top_k=3, include_metadata=True)
    contexts = [
        x['metadata']['text'] for x in res['matches']
    ]

    # build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(contexts)):
        if len("\n\n---\n\n".join(contexts[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts[:i-1]) +
                prompt_end
            )
            break
        elif i == len(contexts)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(contexts) +
                prompt_end
            )
    return prompt

In [32]:
query_with_contexts = retrieve(query)
print(query_with_contexts)

Answer the question based on the context below.

Context:
that to our actual training data. Otherwise if it's just a single sentence we don't actually add it to the training data. I mean ideally we would want to do something like that but for this use case I don't want to get make things too complicated. So the reason I'm doing that is for example this sentence is just a single sentence in that paragraph and we can't guarantee that each continuous paragraph is talking about the same subject you might switch. So for the sake of simplicity I'm just going to ignore the single sentence paragraphs although we do have them in our bag so they can be pulled in as potential sentence

---

sentences and then we say if number of sentences is greater than one oops OK don't excuse it right now and then we apply our 50-50 logic and append that to our actual training data. Otherwise if it's just a single sentence we don't actually add it to the training data. I mean ideally we would want to do someth

In [33]:
get_completion(query_with_contexts)

'You should use the 50-50 logic and append the pairs of related sentences to the actual training data.'