# Retrieval Augmented Generation (RAG)

This is based on https://docs.pinecone.io/docs/gen-qa-openai, but using Astra as a Vector Store, in addition to Pinecone. 

**_Objective_** : Demonstrate how to add domain-specific context to an OpenAI `Completion` call.

# Setup
The following blocks get your environment setup for the project.

## Python Module Install
The following modules will be referenced and need to be installed:

In [97]:
import sys
!{sys.executable} -m pip install -qU \
    openai \
    cassandra-driver \
    "pinecone-client[grpc]==2.2.1" \
    datasets==2.12.0 \
    tqdm \
    ipywidgets \
    pyarrow \
    pandas \
    tiktoken

## Connect to External Services
You can choose to set environment variables with these values, or provide a path to a `.json` file containing credentials in the format like:

```
{
"openai_api_key":"<your OpenAPI key>",
"astra_securebundle_path" : "/path/to/secure/bundle.zip",
"astra_client_id" : "<your Astra CLIENT_ID>",
"astra_client_secret" : "<your Astra CLIENT_SECRET>",
"pinecone_api_key" : "<your Pinecone API key>",
"pinecone_env" : "<your Pinecone Environment>"
}
```



### Helper Function
The following code block is simply a helper file to faciliate setting project variables that should be considered secrets.

In [1]:
# Change this to a path to your credentials file
credentials_file_path = "./credentials.json"

import os
import json

def load_credentials(keys):
    credentials = {}
    if credentials_file_path:
        try:
            with open(credentials_file_path) as f:
                credentials = json.load(f)
        except Exception as e:
            print(f"Unable to load credentials file: {e}")
    
    for key in keys:
        env_key = key.upper()

        # Try to get the environment variable
        value = os.getenv(env_key)

        # If the environment variable is not set, use the credentials file
        if value is None:
            value = credentials.get(key, None)

        # If the value is still None, raise an exception
        if value is None:
            raise Exception(f"{env_key} not set")

        globals()[env_key] = value
        os.environ[env_key] = value

### Connect to OpenAI
This project uses OpenAI for example embeddings and LLM functionality. To proceed with the examples, you will need to have an account created with credits available. See https://platform.openai.com/.

In [2]:
import openai

load_credentials(['openai_api_key'])
openai.api_key = OPENAI_API_KEY

if len(openai.Engine.list()['data'])==0:
    raise Exception("OPENAI_API_KEY invalid, or otherwise unable to connect")

### Connect to Astra
To store and search in Astra, a connection must be established. You must have created a database that has Vector Search enabled. See https://astra.datastax.com/.

In [3]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

load_credentials(['astra_securebundle_path','astra_client_id','astra_client_secret'])

cloud_config = {'secure_connect_bundle': ASTRA_SECUREBUNDLE_PATH}
auth_provider = PlainTextAuthProvider(ASTRA_CLIENT_ID, ASTRA_CLIENT_SECRET)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

### Connect to Pinecone
To store and search in Pinecone, you must have a valid API key. See https://app.pinecone.io/.

In [4]:
import pinecone
load_credentials(['pinecone_api_key','pinecone_env'])

pinecone.init(api_key=os.environ['PINECONE_API_KEY'], environment=os.environ['PINECONE_ENV'])
pinecone.whoami()

  from tqdm.autonotebook import tqdm


WhoAmIResponse(username='626febb', user_label='test-key-2', projectname='899e839')

# Completions
The OpenAI [Completions API](https://platform.openai.com/docs/api-reference/completions) generates a "completion" based on a "prompt". Following the Pinecone example, we can ask a question which would be commonly known and understood, and expect a good result. The following example uses the `text-davinci-003` model, which is an older model but is widely available and will suit our purposes. Other models are documented [here](https://platform.openai.com/docs/models/).

In [5]:
def complete(prompt):
    # query text-davinci-003
    res = openai.Completion.create(
        engine='text-davinci-003',
        prompt=prompt,
        temperature=0,
        max_tokens=400,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    return res['choices'][0]['text'].strip()

complete('who was the 12th person on the moon and when did they land?')


'The 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'

This should tell you that Harrison Schmitt landed on the moon on December 11, 1972, which is factually correct. But that is not always the case, and sometimes the Completion model gets confused. Asking the same question from the Pinecone example, 
> The ideal answer we'd be looking for is "Multiple Negatives Ranking (MNR) loss"

In [7]:
complete('Which training method should I use for sentence transformers when I only have pairs of related sentences?')

'If you only have pairs of related sentences, then the best training method to use for sentence transformers is the supervised learning approach. This approach involves providing the model with labeled data, such as pairs of related sentences, and then training the model to learn the relationships between the sentences. This approach is particularly useful when you have limited data, as it allows the model to learn from the data you have provided.'

The Pinecone example indicates:

> One of the common answers I get to this is:
```
The best training method to use for fine-tuning a pre-trained model with sentence transformers is the Masked Language Model (MLM) training. MLM training involves randomly masking some of the words in a sentence and then training the model to predict the masked words. This helps the model to learn the context of the sentence and better understand the relationships between words.
```

> This answer seems pretty convincing right? Yet, it's wrong. MLM is typically used in the pretraining step of a transformer model but cannot be used to fine-tune a sentence-transformer, and has nothing to do with having "pairs of related sentences".

> An alternative answer I recieve is about supervised learning approach being the most suitable. This is completely true, but it's not specific and doesn't answer the question.

**Retrieval Augmented Generation** (RAG) can allow us to pass additional context to the Completion engine. The overview of this process is basically:

1. Compute an "embedding" of the question text
2. Search for contextual information that is most similar to the question by doing a "vector search"
3. Reframe the question to the Completion engine, adding the additional context

# Embeddings
An "embedding" is a mathematical reduction of complex information (in our case, textual information); we can then search these embeddings to find semantically related text. OpenAI provides a number of [embedding models](https://platform.openai.com/docs/guides/embeddings), and the current recommendation is to use `text-embedding-ada-002` which is what we will do. Let's generate an embedding and see what it looks like.

In [6]:
embed_model = "text-embedding-ada-002"

limerick_embedding = openai.Embedding.create(
    input=[
        "In the world of data, vast and sprawling,",
        "Vector embeddings are enthralling.",
        "Words as points in space, they play,",
        "Mapping meaning in unique array."
    ], engine=embed_model
)

print(limerick_embedding.keys()) # expect to see: dict_keys(['object', 'data', 'model', 'usage'])
print(len(limerick_embedding['data'])) # expect to see: 4
print(limerick_embedding['data'][0].keys()) # expect to see: dict_keys(['object', 'index', 'embedding'])


dict_keys(['object', 'data', 'model', 'usage'])
4
dict_keys(['object', 'index', 'embedding'])


The `Embedding` API has created 4 embeddings, stored within the `data` field. Each of these is a vector containing 1536 dimensions; this dimensionality is an attribute of the embedding model - embedding models will generate vectors of varying dimensions. As the `text-embedding-ada-002` model normalizes the data, individual vector dimension values should be between -1 and 1.

In [10]:
print(len(limerick_embedding['data'][0]['embedding'])) # expect to see: 1536
print(limerick_embedding['data'][0]['embedding'][0:10]) # expect to see a list of 10 float-type numbers between -1 and 1

1536
[-0.009303515776991844, -0.00019089288252871484, 0.014775779098272324, -0.01874099299311638, -0.012967320159077644, 0.027073299512267113, -0.008077782578766346, -0.0013513206504285336, -0.027368010953068733, -0.043510179966688156]


# Contextual Data
Again from the Pinecone example, it discusses a commonly-referenced dataset from the Hugging Face _Datasets_. The particular dataset we will use contains transcribed audio from several ML and tech YouTube channels.

In [11]:
from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

Found cached dataset json (C:/Users/Phil/.cache/huggingface/datasets/jamescalam___json/jamescalam--youtube-transcriptions-08d889f6a5386b9b/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)


Dataset({
    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619
})

Expect something like:
```
Dataset({
    features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
    num_rows: 208619
})
```
This data contains a number of line-by-line transcriptions; you can see this:

In [12]:
transposed_data = [dict(zip(data[0:4], col)) for col in zip(*data[0:4].values())]
for entry in transposed_data:
    print(entry['start'],": ",entry['text'])

0.0 :  Hi, welcome to the video.
3.0 :  So this is the fourth video in a Transformers
9.36 :  from Scratch mini series.
11.56 :  So if you haven't been following along,


Expect something like:
```
0.0 :  Hi, welcome to the video.
3.0 :  So this is the fourth video in a Transformers
9.36 :  from Scratch mini series.
11.56 :  So if you haven't been following along,
```

These short phrases are helpful for video playback (nobody wants to read paragraphs of text when watching a video), but they are less useful if we want to provide additional context to our Completion engine. By merging these snippets of information within a given video, and overlapping adjacent merged snippets, we can create more substantial blocks of context.

In [13]:
from tqdm.auto import tqdm

context_data = []

window = 20  # number of phrases to combine
stride = 4  # number of phrases to 'stride' over, used to create overlap

for i in tqdm(range(0, len(data), stride)):
    i_end = min(len(data)-1, i+window)
    if data[i]['title'] != data[i_end]['title']:
        # in this case we skip this entry as we have start/end of two videos
        continue
    text = ' '.join(data[i:i_end]['text'])
    # create the new merged dataset
    context_data.append({
        'start': data[i]['start'],
        'end': data[i_end]['end'],
        'title': data[i]['title'],
        'text': text,
        'id': data[i]['id'],
        'url': data[i]['url'],
        'published': data[i]['published'],
        'channel_id': data[i]['channel_id']
    })

  0%|          | 0/52155 [00:00<?, ?it/s]

Whereas we previously had 208,619 rows, we now have only 48,688 rows, but each of these is an aggregation of up to 20 rows at a time within each video: 

In [14]:
print(len(context_data))
for i in range(5):
    print(i,": ",context_data[i]['text'])


48688
0 :  Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.
1 :  we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is w

```
0 :  Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.
1 :  we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data. So when we're training a model for mass language modeling, we need a few tensors. We need three tensors. And this is for training Roberta, by the way, as well.
2 :  ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data. So when we're training a model for mass language modeling, we need a few tensors. We need three tensors. And this is for training Roberta, by the way, as well. Same thing with Bert as well. We have our input IDs, attention mask, and our labels. Our input IDs have roughly 15% of their values masked. So we can see that here we have these two tensors.
3 :  we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data. So when we're training a model for mass language modeling, we need a few tensors. We need three tensors. And this is for training Roberta, by the way, as well. Same thing with Bert as well. We have our input IDs, attention mask, and our labels. Our input IDs have roughly 15% of their values masked. So we can see that here we have these two tensors. These are the labels. And we have the real tokens in here, the token IDs. And then in our input IDs tensor, we have these being replaced with mask tokens,
4 :  And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data. So when we're training a model for mass language modeling, we need a few tensors. We need three tensors. And this is for training Roberta, by the way, as well. Same thing with Bert as well. We have our input IDs, attention mask, and our labels. Our input IDs have roughly 15% of their values masked. So we can see that here we have these two tensors. These are the labels. And we have the real tokens in here, the token IDs. And then in our input IDs tensor, we have these being replaced with mask tokens, the number fours. So that's the structure of our input data. We've created a Torch data set from it and use that to create a Torch data loader.
```

If you look at the text, you can see the overlap. On entry `0`, we see the first 4 phrases that we printed above, ending in "So if you haven't been following along,", followed by "we've essentially covered...". This "we've essentially covered..." is the beginning of entry `1`.

# Contextual Embeddings
We can compute embeddings for each of our `context_data` entries by making an OpenAI `Embedding.create` call as above; indeed this is how you would normally expect to do this! Because we want to re-use these embeddings to write to both Pinecone and Astra, we will save them in a `.parquet` file and reference them later. 

You can download pre-generated embeddings, and this is advised to save you time and money. However, you should review the Generate Embeddings subsection to understand how this `.parquet` file has been created, and how embeddings can be efficiently generated.


## Download Pre-Generated Embeddings

To save yourself time and money, embeddings have been generated and saved in a `.zip` file available on Google Drive. 

In [161]:
parquet_filename = "./youtube_transcriptions_with_embeddings.parquet"

# Imports
import sys
!{sys.executable} -m pip install -qU gdown

import zipfile
import os
import shutil
from urllib.parse import urlparse
import gdown

# Download the pre-generated file from Google Drive
url = 'https://drive.google.com/uc?export=download&id=1EyEqTGHDmv6yWB3mOEW2Q4x-Eo5c_Auo'
zip_filename = parquet_filename+".zip"
gdown.download(url, zip_filename, quiet=False)

# Use the 'zipfile' module to extract the zip file
with zipfile.ZipFile(zip_filename, 'r') as zip_ref:
    zip_ref.extractall("./")

# Rename the unzipped file
extracted_filename = zip_ref.namelist()[0]
shutil.move(extracted_filename, parquet_filename)

# Remove the zip file
os.remove(zip_filename)

Downloading...
From (uriginal): https://drive.google.com/uc?export=download&id=1EyEqTGHDmv6yWB3mOEW2Q4x-Eo5c_Auo
From (redirected): https://drive.google.com/uc?export=download&id=1EyEqTGHDmv6yWB3mOEW2Q4x-Eo5c_Auo&confirm=t&uuid=7b130595-5d95-442f-bf0c-ac95f1a95d7f
To: c:\Users\Phil\Documents\GenAI\youtube_transcriptions_with_embeddings.parquet.zip
100%|██████████| 496M/496M [00:08<00:00, 61.0MB/s] 


## Generate Embeddings 
To generate the embeddings from `context_data` and store them in a `.parquet` file, we can run the following code.

**Note:**
1. We call the OpenAI API with a `try/except` block; while we have validated this API works, we could see `RateLimitError` and want to handle that, however crudely;
2. We are batching the OpenAI requests; this API is highly-parallelised, and the time to compute 100 embeddings is not significantly different than 1 embedding.

This code will take something in the region of ≈15 minutes to complete, and when using the `text-embedding-ada-002` model at June 2023 OpenAI prices, it will cost a bit under US$2.00.

In [152]:
parquet_filename = "./youtube_transcriptions_with_embeddings.parquet"

import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import time

def get_embeddings(text_list, embed_model):
    # try/except handles RateLimitError
    done = False
    embedding_list = None
    while not done:
        try:
            response = openai.Embedding.create(input=text_list, engine=embed_model)
            embedding_list = [data['embedding'] for data in response['data']]
            done = True
        except Exception as e:
            print(f"Exception occurred: {e}. Retrying in 5 seconds...")
            time.sleep(5)
    return embedding_list

batch_size=100
context_data_batches = [context_data[i:i+batch_size] for i in range(0, len(context_data), batch_size)]
writer = None

for batch in tqdm(context_data_batches):
    text_list = [batch_entry['text'] for batch_entry in batch]
    embeddings = get_embeddings(text_list, embed_model)
    
    for i, batch_entry in enumerate(batch):
        batch_entry['embedding'] = embeddings[i]
        df = pd.DataFrame([batch_entry])
        table = pa.Table.from_pandas(df)
        
        if writer is None:
            writer = pq.ParquetWriter(parquet_filename, table.schema)

        writer.write_table(table)

if writer is not None:
    writer.close()

  0%|          | 0/487 [00:00<?, ?it/s]

## Reviewing Contextual Embeddings
These embeddings are in a ≈1 GB `.parquet` file. We can iterate over this file by reading it in batches (saving memory), and show the first 5 entries as a Pandas dataframe. Note the computed embedding is a vector, as it was in the previous examples.

In [15]:
import pyarrow.parquet as pq
import pyarrow as pa

parquet_filename = "./youtube_transcriptions_with_embeddings.parquet"

pfile = pq.ParquetFile(parquet_filename)

rows = next(pfile.iter_batches(batch_size = 5)) 
df = pa.Table.from_batches([rows]).to_pandas() 
display(df)

pfile.close()

Unnamed: 0,start,end,title,text,id,url,published,channel_id,embedding
0,0.0,74.12,Training and Testing an Italian BERT - Transfo...,"Hi, welcome to the video. So this is the fourt...",35Pdoyi6ZoQ-t0.0,https://youtu.be/35Pdoyi6ZoQ,2021-07-06 13:00:03 UTC,UCv83tO5cePwHMt1952IVVHw,"[-0.010392856784164906, -0.01836877129971981, ..."
1,15.84,88.76,Training and Testing an Italian BERT - Transfo...,we've essentially covered what you can see on ...,35Pdoyi6ZoQ-t15.84,https://youtu.be/35Pdoyi6ZoQ,2021-07-06 13:00:03 UTC,UCv83tO5cePwHMt1952IVVHw,"[-0.008712108246982098, -0.004878648091107607,..."
2,25.76,102.56,Training and Testing an Italian BERT - Transfo...,"ready to begin actually training our model, wh...",35Pdoyi6ZoQ-t25.76,https://youtu.be/35Pdoyi6ZoQ,2021-07-06 13:00:03 UTC,UCv83tO5cePwHMt1952IVVHw,"[-0.012133925221860409, -0.009328163228929043,..."
3,39.56,114.44,Training and Testing an Italian BERT - Transfo...,we've done so far. So we've built our input da...,35Pdoyi6ZoQ-t39.56,https://youtu.be/35Pdoyi6ZoQ,2021-07-06 13:00:03 UTC,UCv83tO5cePwHMt1952IVVHw,"[-0.016255807131528854, -0.006291277706623077,..."
4,54.04,127.76,Training and Testing an Italian BERT - Transfo...,And we can begin training a model with it. So ...,35Pdoyi6ZoQ-t54.040000000000006,https://youtu.be/35Pdoyi6ZoQ,2021-07-06 13:00:03 UTC,UCv83tO5cePwHMt1952IVVHw,"[-0.017424602061510086, 0.0046691871248185635,..."


As `display` shows truncated values, we can satisfy ourselves of the complete contextual text:

In [16]:
print('Text:', df.loc[0, 'text'])
print('Embedding:', str(df.loc[0, 'embedding']).replace('\n', ''))

Text: Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.
Embedding: [-0.01039286 -0.01836877 -0.00408195 ...  0.00097853 -0.03348809  0.00288187]


## Freeing Resources
As we now have a Parquet file with the relevant information, we can release some memory resources before proceeding.

In [17]:
for var_name in ['data','context_data']:
    try:
        del var_name
    except:
        pass

# Storing Vector Embeddings
Recall our objective here: we want to take an input question, enhance that with semantically-related domain-specific context, and send that combination to the Completion model. To achieve this, we are going to need to:
1. Compute an embedding for each of our `context_data` entries;
2. Save these into a vector database (this is our "long-term memory");
3. Have a means of querying that vector database for close matches to our question.

We need to know the number of dimensions of the model, which we'll demonstrate computationally rather than hard-coding. We will then load the Parquet file into a Pandas dataframe.

After this has been loaded, we are going to create an "index" in Pinecone, and a "table" in Astra. 

In [18]:
parquet_filename = "./youtube_transcriptions_with_embeddings.parquet"

import pyarrow.parquet as pq
import pyarrow as pa

dimension_count = len(limerick_embedding['data'][0]['embedding']) # 1536 for text-embedding-ada-002
print("dimension_count = ",dimension_count)

pfile = pq.ParquetFile(parquet_filename)
df = pfile.read().to_pandas()

dimension_count =  1536


## Pinecone

### Creating a Pinecone Index
Creating an index in Pinecone can take some time (1-2 minutes), so be a little patient here.

In [14]:
pinecone_index_name = "rag-qa"

if pinecone_index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        pinecone_index_name,
        dimension=dimension_count,
        metric='cosine',
        metadata_config={'indexed': ['channel_id', 'published']}
    )
# connect to index
pinecone_index = pinecone.GRPCIndex(pinecone_index_name)
# view index stats
print(pinecone_index.describe_index_stats())

{'dimension': 1536,
 'index_fullness': 0.4,
 'namespaces': {'': {'vector_count': 48688}},
 'total_vector_count': 48688}


### Loading the Pinecone Index
Load the data into Pinecone 100 records at a time. This will take ≈10 minutes to complete.

In [196]:
from tqdm import tqdm

# Define the batch size for Pinecone
pinecone_batch_size = 100

# Create a list to hold the items to upsert to Pinecone
to_upsert = []

# Iterate over the DataFrame
for idx, row in tqdm(df.iterrows(), total=df.shape[0]):
    # Get metadata and embedding
    meta = {
        'start': row['start'],
        'end': row['end'],
        'title': row['title'],
        'text': row['text'],
        'url': row['url'],
        'published': row['published'],
        'channel_id': row['channel_id']
    }
    embed = row['embedding']

    # Add to the list of items to upsert
    to_upsert.append((row['id'], embed, meta))

    # Upsert to Pinecone when we've reached the batch size
    if (idx + 1) % pinecone_batch_size == 0:
        pinecone_index.upsert(vectors=to_upsert)
        to_upsert = []  # Reset the list

# Upsert any remaining items
if to_upsert:
    pinecone_index.upsert(vectors=to_upsert)


100%|██████████| 48688/48688 [10:01<00:00, 80.93it/s]


## Astra

### Creating an Astra Table

Because we know that the `text-embedding-ada-002` model produces normalized embeddings, we can use the `dot_product` similiarity function, rather than the default `cosine`. This is mathematically equivalent (once normalized!) and significantly faster.

In [31]:
KEYSPACE_NAME = 'vsearch'
TABLE_NAME = 'youtube_transcriptions'

session.execute(f"CREATE TABLE IF NOT EXISTS {KEYSPACE_NAME}.{TABLE_NAME} (id text PRIMARY KEY, start float, end float, title text, url text, published text, channel_id text, transcript_text text, embedding VECTOR<FLOAT, {dimension_count}>)")
session.execute(f"CREATE CUSTOM INDEX IF NOT EXISTS youtube_transcriptions_ann ON {KEYSPACE_NAME}.{TABLE_NAME} (embedding) USING 'StorageAttachedIndex' WITH OPTIONS = {{ 'similarity_function': 'dot_product' }}")

<cassandra.cluster.ResultSet at 0x1d1ed59b010>

### Loading the Astra Table
Whereas Pinecone provides a "bulk" interface, Astra (on its own) does not. However, as it is a serverless Cassandra database it is able to absorb a lot of concurrent writes. The following should take about 5 minutes to complete.

In [82]:
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
from tqdm.auto import tqdm
import threading
from concurrent.futures import ThreadPoolExecutor

class DB:
    def __init__(self, cluster: Cluster):
        self.session = cluster.connect()

    def upsert_one(self, row):
        query = SimpleStatement(
            f"""
            INSERT INTO {KEYSPACE_NAME}.{TABLE_NAME}
            (id, start, end, title, transcript_text, url, published, channel_id, embedding)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
            """
        )
        self.session.execute(
            query, (
                row['id'],
                row['start'],
                row['end'],
                row['title'],
                row['text'],
                row['url'],
                row['published'],
                row['channel_id'],
                row['embedding']
            )
        )

thread_local_storage = threading.local()

def get_db():
    if not hasattr(thread_local_storage, 'db_handle'):
        thread_local_storage.db_handle = DB(cluster)
    return thread_local_storage.db_handle

def upsert_row(indexed_row):
    _, row = indexed_row  # unpack tuple
    db = get_db()
    row = row.to_dict()
    row['embedding'] = row['embedding'].tolist()
    db.upsert_one(row) 
  
num_threads = 64
with ThreadPoolExecutor(max_workers=num_threads) as executor:
    list(tqdm(executor.map(upsert_row, df.iterrows()), total=df.shape[0]))


  0%|          | 0/100 [00:00<?, ?it/s]

# Retrieving Text With a Vector Embedding
We now have a vector database loaded with domain-specific content (the YouTube transcriptions) along with vector embeddings that have been computed by our embedding model. Recall the question we were asking at the very beginning:

> Which training method should I use for sentence transformers when I only have pairs of related sentences?

We did not get a great answer when we asked for a Completion directly. What do we get if we look in our vector database for a semantic match? We can compute an embedding for our question, and see!

In [11]:
question = 'Which training method should I use for sentence transformers when I only have pairs of related sentences?'
question_embedding = openai.Embedding.create(
    input=[question], engine=embed_model
)
question_embedding_vector = question_embedding['data'][0]['embedding']
print(question_embedding_vector)

[-0.027706358581781387, -0.021540088579058647, 0.03334249556064606, -0.00221120729111135, -0.0021083198953419924, 0.02385592833161354, 0.01801052875816822, -0.006964954547584057, -0.007191655691713095, -0.03931345418095589, 0.012792916037142277, 0.05424084514379501, 0.0022094633895903826, 0.01788496971130371, 0.009256378747522831, 0.01817793771624565, 0.014069417491555214, 0.0035225858446210623, 0.01668519899249077, -0.021930713206529617, -0.019140545278787613, -0.0031964851077646017, -0.00904711615294218, -0.028027227148413658, -0.007868271321058273, -0.012862670235335827, 0.028096981346607208, -0.02333974651992321, -0.02343740314245224, -0.019321907311677933, 0.026673996821045876, 0.009556322358548641, -0.01841510273516178, -0.010756094008684158, 0.01121647097170353, 0.0013898519100621343, 0.015345918945968151, 0.01848485693335533, 0.025376569479703903, -0.018973136320710182, 0.01704792119562626, 0.011321102268993855, 0.006215096917003393, -0.03724873065948486, 0.007679934613406658, 

Now let us see what we get when we search our vector databases.

## Pinecone
Using the `query` function, we can find the 2 closest matches:

In [21]:
pinecone_index_name = "rag-qa"
pinecone_index = pinecone.GRPCIndex(pinecone_index_name)

pinecone_results = pinecone_index.query(question_embedding_vector, top_k=2, include_metadata=True)
print(pinecone_results)

{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 568.4,
                           'published': '2021-11-24 16:24:24 UTC',
                           'start': 418.88,
                           'text': 'pairs of related sentences you can go '
                                   'ahead and actually try training or '
                                   'fine-tuning using NLI with multiple '
                                   "negative ranking loss. If you don't have "
                                   'that fine. Another option is that you have '
                                   'a semantic textual similarity data set or '
                                   'STS and what this is is you have so you '
                                   'have sentence A here, sentence B here and '
                                   'then you have a score from from 0 to 1 '
                                   'th

You should see a `dict` with a `matches` list containing two elements. The `score` is a Pinecone-derived value documented on the [query API](https://docs.pinecone.io/reference/query):

If you look at the `metadata.text`, you will see this is the transcript text that we combined above:

```
{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 568.4,
                           'published': '2021-11-24 16:24:24 UTC',
                           'start': 418.88,
                           'text': 'pairs of related sentences you can go '
                                   'ahead and actually try training or '
                                   'fine-tuning using NLI with ...',
                           'title': 'Today Unsupervised Sentence Transformers, '
                                    'Tomorrow Skynet (how TSDAE works)',
                           'url': 'https://youtu.be/pNvujJ1XyeQ'},
              'score': 0.865344,
              'sparse_values': {'indices': [], 'values': []},
              'values': []},
             {'id': 'WS1uVMGhlWQ-t737.28',
              'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
                           'end': 900.72,
                           'published': '2021-10-20 17:06:20 UTC',
                           'start': 737.28,
                           'text': "were actually more accurate. So we can't "
                                   "really do that. We can't use this what is "
                                   'called a mean pooling approach ...',
                           'title': 'Intro to Sentence Embeddings with '
                                    'Transformers',
                           'url': 'https://youtu.be/WS1uVMGhlWQ'},
              'score': 0.8586723,
              'sparse_values': {'indices': [], 'values': []},
              'values': []}],
 'namespace': ''}
 ```

Let's finally define a helper function that will return a string of combined `text` for a given embedding.

In [7]:
def pinecone_context(vector_embedding: list[float], top_k: int):
    pinecone_results = pinecone_index.query(vector_embedding, top_k=top_k, include_metadata=True)
    return " ".join(result['metadata']['text'] for result in pinecone_results['matches'])

## Astra
As Astra is a Cassandra database with a Vector-enabled SAI, we can search using the `ANN OF` filter and provide `question_embedding_vector` as a parameter. In this query, we are also computing a scalar `distance` using a similarity cosine; the closer to `1` the more similar the two vectors.

In [17]:
KEYSPACE_NAME = 'vsearch'
TABLE_NAME = 'youtube_transcriptions'

from cassandra.query import SimpleStatement

cql_query = SimpleStatement(
    f"SELECT id, similarity_cosine(embedding, %s) AS distance, transcript_text FROM {KEYSPACE_NAME}.{TABLE_NAME} ORDER BY embedding ANN OF %s LIMIT %s"
)
res = session.execute(cql_query, (question_embedding_vector, question_embedding_vector, 2))
rows = [row for row in res]
for row in rows:
    print (row.id, row.distance, row.transcript_text)


pNvujJ1XyeQ-t418.88 0.9326715469360352 pairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can fine-tune a model using

You should have results similar to:

```
pNvujJ1XyeQ-t418.88 0.9326715469360352 pairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with ...
WS1uVMGhlWQ-t737.28 0.9293355941772461 were actually more accurate. So we can't really do that. We can't use this what is called a mean pooling ...
```

And if you compare the IDs with those in the Pinecone results you may note they are the same results. While this is comforting, know that there is no guarantee that this will always be the same!


Let's finally define a helper function that will return a string of combined `transcript_text` for a given embedding.

In [8]:
def astra_context(vector_embedding: list[float], top_k: int):
    cql_query = SimpleStatement(
        f"SELECT id, similarity_cosine(embedding, %s) AS distance, transcript_text FROM {KEYSPACE_NAME}.{TABLE_NAME} ORDER BY embedding ANN OF %s LIMIT %s"
    )
    res = session.execute(cql_query, (vector_embedding, vector_embedding, top_k))
    return " ".join(row.transcript_text for row in res)

# Completion, With Context
We now have all the pieces to complete Retrieval-Augmented Generation. We're basically going to ask the Completion:

> Answer the question based on the context below.
>
> Context: {Text retrieved from vector database}
>
> Question: {Original question}
>
> Answer: 

We will need to be considerate of the total number of tokens (words and punctuation) the Completion model supports. For example, `text-davinci-003` suppoorts a maximum of 4097 tokens. OpenAI provides an open-source token counter we can use to make sure we do not go over the limit; they document its usage in a [cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).

In [9]:
import tiktoken
token_encoder = tiktoken.encoding_for_model("text-davinci-003")
token_encoding = token_encoder.encode("hello, world!")
print("token list: ",token_encoding)
print("decoded token list: ",token_encoder.decode(token_encoding))
print("decoded token list that has been truncated: ", token_encoder.decode(token_encoding[:2]))

token list:  [31373, 11, 995, 0]
decoded token list:  hello, world!
decoded token list that has been truncated:  hello,


With the given model `text-davinci-003`, we expect to see four tokens:
```
[31373, 11, 995, 0]
```

We can also see the decoder returns back our original string, and that we can truncate the token list (`[:2]`) to eliminate tokens, and still result in meaningful text.

We will now create a template question, based on the above, and compute the number of tokens:

In [12]:
def with_context(question: str, context: str):
    return f"""Answer the question based on the context below.
Context: {context}
Question: {question}
Answer: """

question_with_context = with_context(question, "")
question_with_context_tokens = token_encoder.encode(question_with_context)
print(len(question_with_context_tokens))

39


Finding this to be `39`, we can add up to `4097 - 39 = 4058` tokens of context.

## Pinecone
Putting it all together, we:
1. Get the added context from Pinecone with a vector search, using the vector embedding of our question;
2. We trim this context to `4058` tokens;
3. We call the Completion API with our original question and additional context within our prompt template;
4. Print the original question and answer.

In [15]:
context = pinecone_context(question_embedding_vector, 3)
context = token_encoder.decode(token_encoder.encode(context)[:4058])
answer = complete(with_context(question,context))

print(f"""Question: {question}
Answer: {answer}""")

Question: Which training method should I use for sentence transformers when I only have pairs of related sentences?
Answer: NLI with multiple negative ranking loss.


The expected answer given in the Pinecone example is:

> You should use Natural Language Inference (NLI) with multiple negative ranking loss.

And the answer given should be similar to this.

## Astra
Putting it all together, we:
1. Get the added context from Astra with a vector search, using the vector embedding of our question;
2. We trim this context to `4058` tokens;
3. We call the Completion API with our original question and additional context within our prompt template;
4. Print the original question and answer.

In [18]:
context = astra_context(question_embedding_vector, 3)
context = token_encoder.decode(token_encoder.encode(context)[:4058])
answer = complete(with_context(question,context))

print(f"""Question: {question}
Answer: {answer}""")

Question: Which training method should I use for sentence transformers when I only have pairs of related sentences?
Answer: NLI with multiple negative ranking loss.


The expected answer given in the Pinecone example is:

> You should use Natural Language Inference (NLI) with multiple negative ranking loss.

And the answer given should be similar to this.

# All Together Now
As a final activity, let us define a function that combines all of these steps in one, and ask it to define `NLI`.

In [19]:
def get_answer(question: str, db: str = None, context_depth: int = 3):
    question_embedding = openai.Embedding.create(input=[question], engine=embed_model)
    question_embedding_vector = question_embedding['data'][0]['embedding']
    context = None
    match db:
        case "pinecone":
            context = pinecone_context(question_embedding_vector, context_depth)
        case "astra": 
            context = astra_context(question_embedding_vector, context_depth)
    if context is None:
        print("Asking question without context")
        answer = complete(question)
    else:
        context = token_encoder.decode(token_encoder.encode(context)[:4058])
        print("Asking question with context")
        print(f"Context: {context}")
        answer = complete(with_context(question,context))
    return answer

print(get_answer('What is NLI?'))
print()
print(get_answer('What is NLI?', 'astra')) # you may wish to try 'pinecone' as well

Asking question without context
NLI stands for Natural Language Inference. It is a task in natural language processing (NLP) that involves determining whether a given premise entails a given hypothesis.

Asking question with context
Context: So I fine tune on classification and NLI together and then I look at the performance on NLI on the on the test sets of these tasks, right. And my question is, is that more or less than if I were to just fine tune on NLI, which is this diagonal entry right here, I fine tune on NLI and NLI given like the same compute budget, select one of these two numbers, one, it's sort of the data equivalent to both tasks and one is the compute equivalent. You can choose but given that this cell is green right here, I think the authors have taken have chosen the top number to be sort of the more representative one. In this case, you see that training, switch, sorry, co training classification tasks and NLI tasks benefits NLI task evaluation, compared to just fine 

Note the difference between giving context or not giving context. Apart from the possibility that "NLI" could be a more widely used acronym, when providing the context (from our domain-specific YouTube transcripts), the added context has given us additional detail in the answer.

## Versus ChatGPT
As one last demonstration as to the differences of APIs, let us compare our `get_answer` with the ubiquitous ChatGPT.

In [23]:
print(get_answer('Can you define NLI, in the context of ML?'))
print(get_answer('Can you define NLI, in the context of ML?', 'astra'))

Asking question without context
Asking question with context
Context: So I fine tune on classification and NLI together and then I look at the performance on NLI on the on the test sets of these tasks, right. And my question is, is that more or less than if I were to just fine tune on NLI, which is this diagonal entry right here, I fine tune on NLI and NLI given like the same compute budget, select one of these two numbers, one, it's sort of the data equivalent to both tasks and one is the compute equivalent. You can choose but given that this cell is green right here, I think the authors have taken have chosen the top number to be sort of the more representative one. In this case, you see that training, switch, sorry, co training classification tasks and NLI tasks benefits NLI task evaluation, compared to just fine tuning on NLI tasks. Okay, that's how you read these, these numbers. Or maybe I'm wrong, but that's how I think you read these numbers. So on the other hand, you look for e

'Natural Language Inference (NLI) is a type of machine learning task that requires pairs of sentences to be labeled as either contradictory, neutral, or entailing/inferring each other.'

The answer given at the time of the writing of this notebook:

> Natural Language Inference (NLI) is a type of machine learning task that requires pairs of sentences to be labeled as either contradictory, neutral, or entailing/inferring each other.

Asking ChatGPT running model GPT-3.5 (May 24, 2023):
> In the context of machine learning (ML), NLI stands for Natural Language Inference. It is a task that involves determining the relationship between two given sentences: a premise and a hypothesis. The goal of NLI is to determine whether the hypothesis is entailed, contradicted, or has no relation to the premise. It is also known as Recognizing Textual Entailment (RTE).
>
> NLI is an important problem in natural language processing (NLP) and has various applications, including question answering, information retrieval, dialogue systems, and sentiment analysis. It helps machines understand and reason about the meaning and relationships between sentences, allowing them to comprehend and generate human-like responses.
>
> To solve NLI tasks, machine learning models are trained on labeled datasets where the relationships between premises and hypotheses are annotated. These models typically use techniques such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), or transformer-based architectures like BERT or GPT. The trained models can then be used to classify new pairs of sentences into the appropriate relationship categories (entailment, contradiction, or neutral).

And model GPT-4 (May 24, 2023):
> Sure, in the context of Machine Learning (ML), NLI stands for Natural Language Inference.
> 
> Natural Language Inference is a subfield of Natural Language Processing (NLP), and it's concerned with the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise".
> 
> In other words, it's about teaching machines to understand the relationship between a pair of sentences or pieces of text. Here's an example:
> 
> * Premise: "The dog is sleeping on the couch."
> * Hypothesis 1: "There is a sleeping dog on the furniture." (Entailment)
> * Hypothesis 2: "The dog is running around in the garden." (Contradiction)
> * Hypothesis 3: "The dog is in the house." (Neutral)
> 
> In hypothesis 1, it's entailed or directly implied by the premise. Hypothesis 2 contradicts the information given in the premise. Hypothesis 3 might be true based on the premise, but we don't have enough information to confirm, hence it's neutral.
> 
> NLI is a crucial task for many NLP applications, such as machine translation, question answering, summarization, and more. The performance on NLI tasks is often seen as a measure of a system's ability to understand language in a meaningful and nuanced way.

 # Cleanup
 This Notebook has allocated resources that you may wish to tidy up.

## Local Notebook

In [None]:
import os
os.remove(parquet_filename)

## Pinecone

In [None]:
pinecone.delete_index(pinecone_index_name)

## Astra

In [None]:
session.execute(f"DROP TABLE {KEYSPACE_NAME}.{TABLE_NAME}")