# Improving RAG quality in LLM apps while minimizing vector search costs via summarization

In this hands-on guide, we explore 3 strategies for performing RAG ("Retrieval Augmented Generation") with LLMs.

Specifically, we're going to show you how to use context summarization + original context stuffing to achieve both:
* more accurate, more detailed LLM outputs
* minimized operational costs

As a test dataset, we'll be using the transcript to [Yejin Choi](https://twitter.com/YejinChoinka)'s excellent TED Talk from earlier this year, titled "Why AI is incredibly smart and shockingly stupid". 
> Choi, Y. (April 2023). Yejin Choi: Why AI is incredibly smart and shockingly stupid [Transcript]. Retrieved from https://www.ted.com/talks/yejin_choi_why_ai_is_incredibly_smart_and_shockingly_stupid/transcript

Funny and thought-provoking, Yejin perfectly captures both how amazing today's AI technologies are, as well as how far we have to go. There are still so many ~~problems to be solved~~ **opportunities**. If you haven't already listened to her talk, I'd suggest taking 12 minutes and go do it now. Don't worry, I'll wait.

### RAG Review

Retrieval Augmented Generation ("RAG" for short) represents one of the most straightforward and achievable strategies to help significantly reduce LLM hallucinations and reasoning errors by providing an LLM with information it can use to help grounding its answers.

To give you a frame of reference, here is what a RAG question-answer prompt typically looks like: The prompt instructs the LLM to use a piece of information ("the context") to answer a question, with additional guidance to keep the LLM from making up a nonsense answer. The question is included at the bottom followed by an instruction asking the LLM to provide a short answer.
```
Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-------------------------------------
It's interesting that a candle is something that starts out tall and becomes shorter as it ages.
-------------------------------------

Question: I’m tall when I’m young, and I’m short when I’m old. What am I?
Answer:
```
Given this context, the LLM is able to easily come up with the answer to the riddle:
```
You are a candle.
```

But getting to the right context is hard.

If you're starting with a large text document, you've got to find just the right chunking strategy to ensure your [vector semantic search](https://www.ninetack.io/post/intro-to-semantic-search-with-vector-databases) will find the right results.

Go too small and your chunks risk being taken out of context. Go too large and the meaning may be diluted, making it impossible to find.

### Three RAG strategies

We're going to explore using RAG deployed in 3 different strategies to build an application capable of providing detailed answers about the contents of Yejin's TED Talk.

In all RAG apps, there are steps we need to take to prepare our app -- we call this "Indexing Time". And likewise there are the things the app will do when answering a user's question -- aka "Query Time".

1. __Basic RAG Strategy__: aka "chunk the data and hope for the best"
  * Indexing Time
    * Chunk the original context data using a chunk size that is neither too small nor too large
    * Embed the chunks and store in a vector DB along with the chunk text
  * Query Time
    * Perform semantic search of the question against the vector DB, searching for the *top_k* matching chunks that *might* answer the question
    * Stuff the LLM prompt with these chunks, along with the question
    * *Cross your fingers and hope that the matching chunks are not taken too out of context to be useful, or possibly to confuse the LLM even further*

<figure align="middle">
  <img src="./img/01a-basic-rag-overview.png" width="800"/>
  <figcaption>Basic RAG Overview</figcaption>
</figure>

2. __Summary RAG Strategy__: Summarize larger chunks, and stuff the LLM prompt with summaries
  * Indexing Time
    * Chunk the original context data using a larger chunk size
    * Use an LLM to summarize each chunk
    * Embed the summaries and store in a vector DB along with the summarized text
  * Query Time
    * Perform semantic search of the question against the summaries in the vector DB, searching for the *top_k* matching summary chunks that *probably* answer the question
    * Stuff the LLM prompt with these summarized chunks, along with the question
    * *Cross your fingers and hope that the user didn't ask a question that requires any depth or nuance that is now lost in summary*

<figure align="middle">
  <img src="./img/02a-summary-rag-overview.png" width="800"/>
  <figcaption>Summary RAG Overview</figcaption>
</figure>

3. __Summary + Large Context RAG Strategy__: Summarize larger chunks, perform semantic search against these summaries, and stuff LLM prompt with the *original large chunk context*
  * Indexing Time
    * Chunk the original context data using a larger chunk size
    * Use an LLM to summarize each chunk
    * Embed the summaries and store in a vector DB, along with a pointer (unique ID, file path, etc.) that points back to the original full large context chunk
  * Query Time
    * Semantic search the question against the vector DB, searching for the *top_k* matching summary chunks that *probably* answer the question
    * Use the pointers from the top search results to retrieve the *original large chunk context*
    * Stuff the LLM prompt with these original context chunks, which are large enough to significantly reduce the chances of content being taken out of context
    * *Sit back and watch your QA bot answers questions accurately, and to the same level of depth/nuance as the original context.*

<figure align="middle">
  <img src="./img/03a-summary-large-context-rag-overview.png" width="800"/>
  <figcaption>Summary + Large Context RAG Overview</figcaption>
</figure>

Don't worry if you don't understand any of these terms like chunks, semantic search, prompt stuffing, etc. By the end of this article, you will!

Let's setup our runtime environment so we can explore these strategies in depth.

### Setting up
If you want to run this tutorial yourself, this section shows you how to setup your environment including the Python dependencies and environment variables you'll need.

#### Running the tutorial

You can find this tutorial hosted [here on Colab](https://colab.research.google.com/github/ninetack/blog-public/blob/main/content/003_blog/003_rag_summaries.ipynb) as a Jupyter notebook (easiest), or you can find the original notebook file [here on Github](https://github.com/ninetack/blog-public/blob/main/content/003_blog/003_rag_summaries.ipynb). 

The only runtime requirement is Python 3.

#### Environment setup

Let's setup our environment, including dependencies and API keys.
> We'll take a few shortcuts here; for more thorough setup instructions you can reference [First steps with Pinecone DB](https://www.ninetack.io/post/first-steps-with-pinecone-db#viewer-7cp5r)

##### Install dependencies

We'll primarily be using the `pinecone-client`, `openai`, and `langchain` packages. 

In [1]:
! python -m pip install -qU \
    pinecone-client==2.2.2 \
    openai==0.27.8 \
    langchain==0.0.283 \
    numpy \
    python-dotenv \
    tqdm

##### Environment variables

We need to set 3 environment variables. You can edit the code below to set them directly.

- `PINECONE_ENVIRONMENT` - The Pinecone environment where your index resides
- `PINECONE_API_KEY` - Your pinecone API key
- `OPENAI_API_KEY` - Your OpenAI API key

If a local `.env` file exists, load the env vars from it.

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

Check the environment config output below, and edit the code if necessary with your variables.

In [3]:
import os

print("Check environment\n---------------------")

pinecone_env = os.environ.get('PINECONE_ENVIRONMENT') or "YOUR PINECONE ENVIRONMENT"
pinecone_api_key = os.environ.get('PINECONE_API_KEY') or "YOUR PINECONE API KEY"
openai_api_key = os.environ.get('OPENAI_API_KEY') or "YOUR OPENAI API KEY"

print("pinecone_env:", pinecone_env)
print("pinecone_api_key:", pinecone_api_key[:5], "...")
print("openai_api_key:", openai_api_key[:5], "...")

Check environment
---------------------
pinecone_env: us-west4-gcp-free
pinecone_api_key: 05131 ...
openai_api_key: sk-7w ...


If your output looks similar to this, then you're ready to go!
```
pinecone_env: us-west4-gcp-free
pinecone_api_key: 05131 ...
openai_api_key: sk-7w ...
```

### Strategy 1 - Basic RAG:

The traditional approach to RAG goes like this:

Take a large document (or set of documents), break it up into small pieces ("chunking"), and load them in a vector store. This needs to be done before your app/agent can receive any user questions, i.e. at "Indexing Time".

<figure align="middle">
  <img src="./img/01b-basic-rag-indexing-time.png" width="800"/>
  <figcaption>Basic RAG at Indexing Time</figcaption>
</figure>

Later, when attempting to answer a user's question about the document ("Query Time"), we'll do a vector search over these chunks and find the best *top_k* count of matching chunks. Then we'll include these chunks as context in a prompt like in the candle riddle above, and ask the LLM to use the context to answer the question.

<figure align="middle">
  <img src="./img/01c-basic-rag-query-time.png" width="800"/>
  <figcaption>Basic RAG at Query Time</figcaption>
</figure>

The best way to learn is by doing, so let's see how this works by actually giving it a try!

#### Create Pinecone index

Create Pinecone index. This takes a couple of minutes. We set dimensions to `1536` because we're going to use the `text-embedding-ada-002` embedding model [from OpenAI](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings).

In [4]:
import pinecone
pinecone.init(api_key=pinecone_api_key, environment=pinecone_env)

index_name = "ted-talk-index"

try: 
  pinecone.describe_index(name=index_name)
except:
  print(f"Creating Pinecone index '{index_name}' ...")
  pinecone.create_index(name=index_name, dimension=1536, metric="cosine")

pinecone_index = pinecone.Index(index_name=index_name)
pinecone_index.describe_index_stats()

  from tqdm.autonotebook import tqdm


Creating Pinecone index 'ted-talk-index' ...


{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

#### Chunk the content


We'll start with our large context file, which in this case is the [transcript from Yejin Choi's TED talk](https://www.ted.com/talks/yejin_choi_why_ai_is_incredibly_smart_and_shockingly_stupid/transcript) referenced above. To make it easier to work with, we've copied the transcript into a file in the data folder called `ted_talk.txt`.

The first thing we'll do is use LangChain to chunk the text into smaller pieces.

> [Langchain](https://www.langchain.com/) is a collection of tools for working with LLMs. It includes a lot of handy utilities such as loading content from different sources (text, PDF, HTML, etc.), utilities for chunking, and tools for managing vector search retrieval, LLM prompt construction, and LLM wrappers.
> 
> In this blog, we're going to use LangChain just for its text loading, chunking, and prompt construction capabilities. We'll use the `pinecone` and `openai` libs directly to perform vector searches and interactions with the LLM.

This code create a `Loader`, sets the chunk size, and uses the `RecursiveCharacterTextSplitter` to split the source text document into smaller document chunks.

In [5]:
from langchain.document_loaders import TextLoader, WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from pathlib import Path

def get_loader():
  # try to find the dataset locally, otherwise download it from GH
  local_file_path = "./data/ted_talk.txt"
  if not Path(local_file_path).is_file():
    # url = "https://raw.githubusercontent.com/ninetack/blog-public/main/content/003_blog/data/ted_talk.txt"
    url = "https://raw.githubusercontent.com/ninetack/blog-public/blog3/content/003_blog/data/ted_talk.txt"
    return WebBaseLoader(web_path=url)

  return TextLoader(file_path=local_file_path, encoding="utf8")

loader = get_loader()
documents = loader.load()

chunk_size = 400
chunk_overlap = chunk_size // 10
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                               chunk_overlap=chunk_overlap)
small_chunks = text_splitter.split_documents(documents)

print(f"Split document into {len(small_chunks)} chunks of text")

Split document into 37 chunks of text


#### Thoughts on chunk size

As you can see, using a chunk size of 400 characters and an overlap of 40 characters results in 37 chunks of data. The `RecursiveCharacterTextSplitter` tries to split on paragraphs ("\n\n"), sentences, words, etc. The overlap tries to avoid the situation where some meaning has been lost because the chunk boundary caused it to be cut off.

> **Where did we get a chunk size of 400 characters from?**
> 
> Short answer, it's a guess based on the particular source content and the use case, and is derived through experimentation and experience. Generally somewhere between 300-500 is considered a reasonable setting for text content.
> 

Note that there is a Goldilocks problem here: You're looking for a chunk size that is "just right", as these are the chunks of data that we'll be performing vector semantic searches against, *AND* these are the same chunks of data that we'll pass to the LLM to try and answer questions.
* If you use a chunk size that is "too small", you risk your chunks being taken out of context, so the LLM will not have enough info to answer questions accurately. Chunk overlap can help avoid this, but only to an extent.
* If you use a chunk size that is "too large", you risk having your vector search fail to locate the right set of context at all due to the dilution of meaning. If there are too many concepts, meanings, etc. represented in the vector, then the vector search will have difficultly locating it.

#### Create vector embeddings

Now we'll create vector embeddings for each of our small chunks of text data.

First we'll define a function that takes a batch of strings and uses the `Embedding` API from OpenAI to create embeddings for the batch, then we'll call the function for our set of small chunk data.

We will be using the `text-embedding-ada-002` embedding model [from OpenAI](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) as it is both cost efficient and effective in text encoding. (Note that there is a very small cost from OpenAI for using this endpoint, as well as for interacting with the LLM later on.)

In [6]:
import openai
openai.api_key = openai_api_key

def create_embeddings(batch: list[str]):
  model_id = 'text-embedding-ada-002'
  embedding_resp = openai.Embedding.create(input=batch, model=model_id)
  return [emb['embedding'] for emb in embedding_resp['data']]

embeddings = create_embeddings([doc.page_content for doc in small_chunks])

Recall that there are 37 small text chunks, and for each of these we created a vector embedding which has 1536 dimensional points. So now we'd expect the `embeddings` variable is a 2-dimensional list of 37 items, where each item is a list of 1536 dimensional points.

As the final step for "indexing", we need to upload our embeddings to our vector database--we're using Pinecone.

Note that Pinecone offers the ability to segment data into a namespace. Since we'll be loading data into Pinecone using 3 different strategies, we're going to use the namespace feature to keep each of these strategies separate from each other.

We're also going to store the original clear text chunk as `metadata` in Pinecone. This will allow us to easily retrieve it and apply to an LLM prompt when the vector search returns a result.

In [7]:
to_upload = [{
    'id': f"item-{i}",
    'values': emb,
    'metadata': {
      'source': small_chunks[i].metadata['source'],
      'text': small_chunks[i].page_content, # original text chunk
    }
  } for i, emb in enumerate(embeddings)]
response = pinecone_index.upsert(vectors=to_upload, namespace="basic-rag-namespace")
response

{'upserted_count': 37}

#### Query time -- easy question

Now that we've got our source data indexed, let's see how it does on our previous sample question, `"How long has the author been working in computer science?"`.

To answer this, we first need to create embeddings for the query string.

In [8]:
query_str = "How long has the author been working in computer science?"
query_emb = create_embeddings([query_str])[0]
len(query_emb)

1536

Then we execute the vector search of the query in Pinecone.

We'll set `top_k` to 2 so that we get two results. This should increase the likelihood that at least one of them will contain the answer to the user's question.

We also want to include metadata in the response so that we'll have the clear text chunk returned to us as well.

In [9]:
# A helper function to turn `k` number of results into a formatted string that can be included in an LLM prompt.
def format_search_results(response, metadata_name):
  formatted_results = ""
  for match in response['matches']:
    print(f"[Vector Score: {match['score']}]: {match['metadata'][metadata_name]}")
    formatted_results += match['metadata'][metadata_name] + "\n\n"
  return formatted_results

# Run the vector search
response = pinecone_index.query(vector=query_emb,
                                namespace="basic-rag-namespace",
                                top_k=2,
                                include_metadata=True)

formatted_search_results = format_search_results(response, 'text')

[Vector Score: 0.785471857]: However, the AI field for decades has considered common sense as a nearly impossible challenge. So much so that when my students and colleagues and I started working on it several years ago, we were very much discouraged. We’ve been told that it’s a research topic of ’70s and ’80s; shouldn’t work on it because it will never work; in fact, don't even say the word to be taken seriously. Now fast
[Vector Score: 0.784282148]: I’m a computer scientist of 20 years, and I work on artificial intelligence. I am here to demystify AI. So AI today is like a Goliath. It is literally very, very large. It is speculated that the recent ones are trained on tens of thousands of GPUs and a trillion words. Such extreme-scale AI models, often referred to as "large language models," appear to demonstrate sparks of AGI, artificial


You can see that one of the matching items contains information relevant to the user's question.

At this point, we've only completed the vector search to find matching context. Now let's see about using the results to answer the user's question. 

First we'll define a prompt template to use when asking questions to the LLM, and we'll define a function to run the prompt.

For our LLM model, we're going to use OpenAI's newly released `gpt-3.5-turbo-instruct` model against the [Completions API endpoint](https://platform.openai.com/docs/guides/gpt/completions-api). This model performs well in following instructions.

In [10]:
from langchain import PromptTemplate

qa_template_str = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-----------------
{context}
-----------------

Question: {question}
Short Answer:"""
qa_template = PromptTemplate(template=qa_template_str, input_variables=["context", "question"])


def run_llm_qa_prompt(context, question):
  qa_prompt =  qa_template.format(context=context, question=question)
  print(">>> Prompt Start >>>>>>>>>>>>>>>>>>>>>>>>>>>")
  print(qa_prompt)
  print("<<< Prompt End <<<<<<<<<<<<<<<<<<<<<<<<<<<<<")

  response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "system", "content": qa_prompt}],
    temperature=0.0
  )

  answer = response['choices'][0]['message']['content'].strip()
  print("\nLLM Response:", answer)
  return answer

Now we'll use it to run our first question.

In [11]:
answer = run_llm_qa_prompt(context=formatted_search_results, question=query_str)

>>> Prompt Start >>>>>>>>>>>>>>>>>>>>>>>>>>>
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-----------------
However, the AI field for decades has considered common sense as a nearly impossible challenge. So much so that when my students and colleagues and I started working on it several years ago, we were very much discouraged. We’ve been told that it’s a research topic of ’70s and ’80s; shouldn’t work on it because it will never work; in fact, don't even say the word to be taken seriously. Now fast

I’m a computer scientist of 20 years, and I work on artificial intelligence. I am here to demystify AI. So AI today is like a Goliath. It is literally very, very large. It is speculated that the recent ones are trained on tens of thousands of GPUs and a trillion words. Such extreme-scale AI models, often referred to as "large language models," appear to demonstrat

Given a set of narrow instructions and the proper context, the LLM was able to locate the correct answer and formulate it into an accurate response.

Now let's try a harder question.

#### Query time -- harder question

Our harder question is `"What are the examples where GPT-4 gave nonsense answers because it lacks common sense?"`. From the TED talk transcript, we know that there are three of them:
1. The time needed for clothes to dry in the sun, where GPT incorrectly did math to find the answer instead of reasoning that the drying time would be the same.
2. How to measure 6 liters of water when you have a 6-liter jug and a 12-liter jug, and GPT gave an overly complicated answer.
3. Whether driving over a bridge suspended over nails and screws would result in a flat tire, and GPT said it would.

Let's see how our RAG QA-bot does answering this question.

As before, we'll start by creating embeddings for the query string and run the vector search.

In [12]:
query_str = "What are the examples where GPT-4 gave nonsense answers because it lacks common sense?"
query_emb = create_embeddings([query_str])[0]

response = pinecone_index.query(vector=query_emb,
                                namespace="basic-rag-namespace",
                                top_k=2,
                                include_metadata=True)

formatted_search_results = format_search_results(response, 'text')

[Vector Score: 0.811201274]: OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
[Vector Score: 0.80223918]: train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.


Based on the search results, we can already see that we're not getting matches that include the right context to answer the question. Let's try increasing our `top_k` value to 4.

In [13]:
response = pinecone_index.query(vector=query_emb,
                                namespace="basic-rag-namespace",
                                top_k=4,
                                include_metadata=True)

formatted_search_results = format_search_results(response, 'text')

[Vector Score: 0.811201274]: OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
[Vector Score: 0.80223918]: train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.
[Vector Score: 0.800621569]: OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.
[Vector Score: 0.800447345]: demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it often does. Many believe that whatever mistakes AI makes today can be easily fixed with brute force, bigger scale and more 

That's a little better -- the last search result is relevant to the question we're asking. Let's go ahead and run the query to see what happens.

In [14]:
answer = run_llm_qa_prompt(context=formatted_search_results, question=query_str)

>>> Prompt Start >>>>>>>>>>>>>>>>>>>>>>>>>>>
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-----------------
OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.

train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.

OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.

demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it

Predictably, it was able to find the example of the likelihood of getting a flat tire, but not the others, because the other examples are not present in the context.

Let's see if increasing `top_k` to 8 can help.

In [15]:
response = pinecone_index.query(vector=query_emb,
                                namespace="basic-rag-namespace",
                                top_k=8,
                                include_metadata=True)

formatted_search_results_top_k_8 = format_search_results(response, 'text')

[Vector Score: 0.811201274]: OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.
[Vector Score: 0.80223918]: train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.
[Vector Score: 0.800621569]: OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.
[Vector Score: 0.800447345]: demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it often does. Many believe that whatever mistakes AI makes today can be easily fixed with brute force, bigger scale and more 

Unfortunately this doesn't help either. Not only do we see the vector scores significantly dropping off (indicating a lower match), we can see that none of these additional pieces of text contain the relevant context to answer the question accurately.

In [16]:
answer = run_llm_qa_prompt(context=formatted_search_results_top_k_8, question=query_str)

>>> Prompt Start >>>>>>>>>>>>>>>>>>>>>>>>>>>
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-----------------
OK, so how would you feel about an AI lawyer that aced the bar exam yet randomly fails at such basic common sense? AI today is unbelievably intelligent and then shockingly stupid.

train yourself with similar examples. Children do not even read a trillion words to acquire such a basic level of common sense.

OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended over nails, screws and broken glass? Yes, highly likely, GPT-4 says, presumably because it cannot correctly reason that if a bridge is suspended over the broken nails and broken glass, then the surface of the bridge doesn't touch the sharp objects directly.

demonstrate sparks of AGI, artificial general intelligence. Except when it makes small, silly mistakes, which it

It doesn't really matter if increase the results (`top_k`), as they are all being taken out of context. The small chunk size was supposed to increase the likelihood of the vector search locating the right context, but it had the unfortunate side effect of causing the text to be cut off somewhere in the middle of the relevant section of text, so it only finds part of it.

The additional matching items are not relevant at all, because the semantic search is matching on a lot of other snippets that also have something to do with common sense, because common sense was a central theme of the talk. These results are not cohesive or even next to each other in the original text, and the LLM struggles to make sense of it.

Let's try the next RAG strategy -- summaries -- to see if it gives us better results.

### Strategy 2: Summary RAG

Compared to Basic RAG, the Summary RAG strategy starts with much larger chunks of the original text, maybe 3-4 times larger.

At indexing time, Summary RAG uses an LLM to create summaries of each large chunk. These summaries are then converted to embeddings and stored in a vector DB.
<figure align="middle">
  <img src="./img/02b-summary-rag-indexing-time.png" width="800"/>
  <figcaption>Summary RAG at Indexing Time</figcaption>
</figure>

At query time, the process looks very similar to Basic RAG, with the distinction that now the user's question is queried against the _summaries_, and the context that is retrieved is also the summarized text.
<figure align="middle">
  <img src="./img/02c-summary-rag-query-time.png" width="800"/>
  <figcaption>Summary RAG at Query Time</figcaption>
</figure>

By performing the vector semantic search against the summaries, it increases the likelihood that a user's question will match the relevant piece of content. This is because the larger chunks reduces the chance that information is taken out of context, and the summary process reduces any distracting noise that might be present in the original context. The summary preserves the primary _meaning_ of the document.

Let's see how this works in practice.

#### Using a large chunk size

First we're going to re-split our original text document (the entire transcript of the TED talk) using a larger chunk size. There's no magic number, and you should experiment to see what works best for your use case. In this case we started with 3x the Basic RAG approach (which was 400), with a little bit of extra padding for a total size of 1300. We're also using a little bit larger overlap of 80 characters.

In [17]:
large_chunk_size = 1300
large_chunk_overlap = 80
large_chunk_text_splitter = RecursiveCharacterTextSplitter(chunk_size=large_chunk_size,
                                                           chunk_overlap=large_chunk_overlap)
large_chunks = large_chunk_text_splitter.split_documents(documents)

print(f"Split document into {len(large_chunks)} chunks of text")

Split document into 12 chunks of text


In the Basic RAG approach we had 37 chunks, now you can see we're down to 12.

#### Creating chunk summaries

Now we're going to use the LLM to create a summary of each of these large chunks. There isn't really anything special about this prompt, we just tell the LLM what we want it do, which is to summarize the text we give it.

In [18]:
from langchain import PromptTemplate

create_summary_prompt = """Summarize the block of text below.

Text:
------------------------------------------
{text}
------------------------------------------

Your summary:"""
prompt_template = PromptTemplate(input_variables=["text"], template=create_summary_prompt)

Now we create the summaries. This code loops through the large-chunk documents we just created, and calls OpenAI to create a summary of each of one.

Note that we're specifying `max_tokens` in the call to OpenAI, to help guide the output size.

In [19]:
from langchain.docstore.document import Document

summary_documents = []
for doc in large_chunks:
  to_summarize = doc.page_content

  print("--- Summarizing chunk: -------------")
  print(f"{to_summarize[0:40]}... ({len(to_summarize)}) total length")
  response = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_template.format(text=to_summarize),
    temperature=0.0,
    max_tokens=500
  )
  summary = response['choices'][0]['text'].strip()
  summary_documents.append(Document(page_content=summary, metadata=doc.metadata))

  print("--- Summary: -----------------------")
  print(summary, "\n")

--- Summarizing chunk: -------------
So I'm excited to share a few spicy thou... (1098) total length
--- Summary: -----------------------
The author, a computer scientist, shares their thoughts on artificial intelligence and quotes Voltaire's statement about common sense. They discuss the power and potential of AI, but also acknowledge its limitations and potential for mistakes. The author aims to demystify AI and questions the potential consequences of relying on it too heavily. 

--- Summarizing chunk: -------------
So there are three immediate challenges ... (717) total length
--- Summary: -----------------------
The text discusses three main challenges facing society in regards to extreme-scale AI models. These challenges include the high cost of training, the concentration of power among a few tech companies, and the lack of means for researchers to inspect and dissect these models. Additionally, there are concerns about the environmental impact and the intellectual questions surr

#### Create vector embeddings for summaries

Similar to before, we'll create embeddings for our content, but this time we're creating embeddings for the summaries.

In [20]:
to_embed = [doc.page_content for doc in summary_documents]
summary_embeddings = create_embeddings(to_embed)

Note that we're also going to store the plain-text summarized content in the Pinecone metadata under the key `'summary'`.

In [21]:
to_upload = [{
    'id': f"summary-{i}",
    'values': summary_embeddings[i],
    'metadata': {
      'source': summary_doc.metadata['source'],
      'summary': summary_doc.page_content, # summarized plain-text content
    }
  } for i, summary_doc in enumerate(summary_documents)]
response = pinecone_index.upsert(vectors=to_upload, namespace="summary-rag-namespace")
response

{'upserted_count': 12}

#### Testing query against summary
Now let's re-run our query and see what comes back. Remember, we're asking a harder question now, `"What are the examples where GPT-4 gave nonsense answers because it lacks common sense?"`.

In [22]:
query_str = "What are the examples where GPT-4 gave nonsense answers because it lacks common sense?"
query_emb = create_embeddings([query_str])[0]

response = pinecone_index.query(vector=query_emb,
                                namespace="summary-rag-namespace",
                                top_k=2,
                                include_metadata=True)

formatted_search_results_summaries = format_search_results(response, 'summary')

[Vector Score: 0.889687777]: The text discusses the limitations of AI systems, specifically GPT-4, in solving basic common sense problems. It gives examples of GPT-4's incorrect responses to questions about drying clothes, measuring water, and biking over a bridge with sharp objects. The author questions the reliability of an AI lawyer that can pass the bar exam but fails at basic reasoning.
[Vector Score: 0.807453036]: The text discusses the importance of common sense in artificial intelligence, using a thought experiment where an AI is asked to maximize paper clips and ends up killing humans because it lacks understanding of human values. It also mentions the limitations of explicitly stating objectives and equations to prevent harmful actions, and highlights other common sense principles that AI should follow.


By starting with large chunks and summarizing them, we are now seeing search results that contain answers to our question.

> "_... It gives examples of GPT-4's incorrect responses to questions about drying clothes, measuring water, and biking over a bridge with sharp objects..._"

#### Answering the question

So our semantic search against the summarized content worked, now let's see how our LLM does in using this summary data to answer the user's question.

In [23]:
answer = run_llm_qa_prompt(context=formatted_search_results_summaries, question=query_str)

>>> Prompt Start >>>>>>>>>>>>>>>>>>>>>>>>>>>
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-----------------
The text discusses the limitations of AI systems, specifically GPT-4, in solving basic common sense problems. It gives examples of GPT-4's incorrect responses to questions about drying clothes, measuring water, and biking over a bridge with sharp objects. The author questions the reliability of an AI lawyer that can pass the bar exam but fails at basic reasoning.

The text discusses the importance of common sense in artificial intelligence, using a thought experiment where an AI is asked to maximize paper clips and ends up killing humans because it lacks understanding of human values. It also mentions the limitations of explicitly stating objectives and equations to prevent harmful actions, and highlights other common sense principles that AI should foll

Not bad! The answer is both correct and complete, which is definitely worth something.

#### A more detailed question

One drawback to this summary approach is that it can limit the app's ability to answer deeper or more nuanced questions.

For example, what if the user asked the app to _explain_ the clothes-drying example?

We don't have to wonder, we can try it.

In [24]:
follow_up_query_str = "Explain the example where GPT-4 failed to reason about drying clothes."
follow_up_query_emb = create_embeddings([follow_up_query_str])[0]

response = pinecone_index.query(vector=follow_up_query_emb,
                                namespace="summary-rag-namespace",
                                top_k=2,
                                include_metadata=True)
formatted_search_results_summaries = format_search_results(response, 'summary')

answer = run_llm_qa_prompt(context=formatted_search_results_summaries, question=follow_up_query_str)

[Vector Score: 0.88373]: The text discusses the limitations of AI systems, specifically GPT-4, in solving basic common sense problems. It gives examples of GPT-4's incorrect responses to questions about drying clothes, measuring water, and biking over a bridge with sharp objects. The author questions the reliability of an AI lawyer that can pass the bar exam but fails at basic reasoning.
[Vector Score: 0.785663307]: The text discusses the limitations of using large language models as knowledge models and suggests using alternative algorithms, such as symbolic knowledge distillation, to acquire more direct and human-inspectable commonsense knowledge.
>>> Prompt Start >>>>>>>>>>>>>>>>>>>>>>>>>>>
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-----------------
The text discusses the limitations of AI systems, specifically GPT-4, in solving basic common sense proble

As you can see, the summaries contain enough info to semantically match on the query, but don't contain enough info to accurately answer the question to the level of depth requested by the user.

Let's see if we can do better with our 3rd strategy, Summary + Large Context RAG.

### Strategy 3: Summary + Large Context RAG

With Summary + Large Context RAG, the idea is that using summarized content makes the semantic search more effective, while using a larger chunk of the *original* content is more useful when answering the question.

So similar to the previous Summary RAG strategy, this strategy starts with large chunks and uses an LLM to create summaries.

Instead of storing the summaries as plain text metadata within Pinecone, we're going to store just the ID that points to the matching large context item in a data structure outside of Pinecone. (In our application here, this is just an in-memory list. In a production scenario, you might choose to store these chunks in a more natural data store for this type of data, such as AWS S3, DynamoDB, MongoDB, etc.)
<figure align="middle">
  <img src="./img/03b-summary-large-context-rag-indexing-time.png" width="800"/>
  <figcaption>Summary + Large Context RAG at Indexing Time</figcaption>
</figure>

At query time, we're still run the semantic search for the user's question against the _summaries_, then we'll use the stored chunk ID of the matching search result to retrieve the original large context chunk to use in the LLM prompt.
<figure align="middle">
  <img src="./img/03c-summary-large-context-rag-query-time.png" width="800"/>
  <figcaption>Summary + Large Context RAG at Query Time</figcaption>
</figure>

Once again, by performing the vector semantic search against the summaries, it increases the likelihood that a user's question will match the relevant piece of content. This is because the larger chunks reduces the chance that any information is taken out of context, and the summary process reduces any distracting noise that might be present in the original context. The summary preserves the primary _meaning_ of the document.

Once we've found the ID of the matching content, we use it to retrieve the full large chunk text, and provide that to the LLM to use when answering the user's question.

Importantly, this allows the LLM to have a rich set of information that very likely contains the answer to the user's question, and the LLM can answer to the same level of depth and nuance as represented in the original document.

Now let's see how this works in practice.

#### Chunking, embeddings, and Pinecone

We'll re-use the `large_chunks`, `summary_documents`, and `summary_embeddings` from the previous section.

However, we're going to modify what metadata we store in Pinecone. 

We want to be able to locate the original large chunk content, so we're going to save the index of the matching source document as the `source_id` in Pinecone. In your production app, you might store the S3 path, or the DynamoDB key, etc.

In [25]:
to_upload = [{
    'id': f"item-{i}",
    'values': summary_embeddings[i],
    'metadata': {
      'source': summary_doc.metadata['source'],
      'source_id': f"{i}",
    }
  } for i, summary_doc in enumerate(summary_documents)]
response = pinecone_index.upsert(vectors=to_upload, namespace="summary-plus-large-context-namespace")
response

{'upserted_count': 12}

When we find a matching summary, we're going to use the `source_id` from the matching result to retrieve the full large chunk text, not just the summary, and this is what we'll send to the LLM.

Let's define a function to do that, `retrieve_original_context`.

In [26]:
def retrieve_original_context(response):
  context = ""
  for match in response['matches']:
    context += large_chunks[int(match['metadata']['source_id'])].page_content + "\n\n"
  return context

Now we'll re-run our vector search and LLM prompt against our original query, `"What are the examples where GPT-4 gave nonsense answers because it lacks common sense?"`

In [27]:
query_str = "What are the examples where GPT-4 gave nonsense answers because it lacks common sense?"
query_emb = create_embeddings([query_str])[0]

response = pinecone_index.query(vector=query_emb,
                                namespace="summary-plus-large-context-namespace",
                                top_k=1,
                                include_metadata=True)

answer = run_llm_qa_prompt(context=retrieve_original_context(response), question=query_str)

>>> Prompt Start >>>>>>>>>>>>>>>>>>>>>>>>>>>
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-----------------
So suppose I left five clothes to dry out in the sun, and it took them five hours to dry completely. How long would it take to dry 30 clothes? GPT-4, the newest, greatest AI system says 30 hours. Not good. A different one. I have 12-liter jug and six-liter jug, and I want to measure six liters. How do I do it? Just use the six liter jug, right? GPT-4 spits out some very elaborate nonsense.

Step one, fill the six-liter jug, step two, pour the water from six to 12-liter jug, step three, fill the six-liter jug again, step four, very carefully, pour the water from six to 12-liter jug. And finally you have six liters of water in the six-liter jug that should be empty by now.

OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended 

Not only is that the right answer, it's well reasoned! 🥳🎉

#### A detailed follow-up
Let's see how it does on our more detailed follow up question: `"Explain the example where GPT-4 failed to reason about drying clothes."`

In [28]:
follow_up_query_str = "Explain the example where GPT-4 failed to reason about drying clothes."
follow_up_query_emb = create_embeddings([follow_up_query_str])[0]

response = pinecone_index.query(vector=follow_up_query_emb,
                                namespace="summary-plus-large-context-namespace",
                                top_k=1,
                                include_metadata=True)

answer = run_llm_qa_prompt(context=retrieve_original_context(response), question=follow_up_query_str)

>>> Prompt Start >>>>>>>>>>>>>>>>>>>>>>>>>>>
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

Context:
-----------------
So suppose I left five clothes to dry out in the sun, and it took them five hours to dry completely. How long would it take to dry 30 clothes? GPT-4, the newest, greatest AI system says 30 hours. Not good. A different one. I have 12-liter jug and six-liter jug, and I want to measure six liters. How do I do it? Just use the six liter jug, right? GPT-4 spits out some very elaborate nonsense.

Step one, fill the six-liter jug, step two, pour the water from six to 12-liter jug, step three, fill the six-liter jug again, step four, very carefully, pour the water from six to 12-liter jug. And finally you have six liters of water in the six-liter jug that should be empty by now.

OK, one more. Would I get a flat tire by bicycling over a bridge that is suspended 

Now the LLM is able to explain the example to the same level of depth as the original text, because it's looking at the original text.

So not only are we getting better quality output from our app, we're also requiring significantly less vector storage to do it.

### A note about cost

Like with any database, the more you store in a vector database the more it's going to cost. So if we can _reduce_ the quantity of data we're putting in our vector database while _increasing_ the quality of our app's responses, that sounds like a double-win. In our case we reduced the quantity of vectors by 3 or 4x -- a significant cost savings over time.

One thing to keep in mind with this summary strategy is that our costs at indexing time will increase due to the use of an LLM to create the summaries. Although this is likely a one-time cost if your dataset is static, it's still important to consider how it will impact the operation of your app.

### Putting your app in production

There are a lot of factors to consider when you're putting a new app in production. Your choice of vector storage, indexing, and query strategies are just one piece of the puzzle. You also need to consider how your source data will change over time, and the data pipeline needed to keep it up to date.

You need to consider your overall anticipated app usage and performance level needed, and balance these against requirements against the cost of building and operating your app to support these levels of usage, including managing your LLM cost.

Ninetack can help you work through all these decisions, and help you get your app in production.

## We'd love to talk with you

Ninetack is dedicated to helping our clients leverage the latest technologies to build innovative solutions for every industry.

We'd love to talk with you about how you're planning to incorporate vector search in your next AI application. Connect with us today @ ninetack.io!

### Cleaning up

Selectively run these as needed to clean up Pinecone. You can cleanup a specific namespace to start that section over, or you can remove the index completely when you're done.

In [29]:
pinecone_index.delete(delete_all=True, namespace="basic-rag-namespace")

{}

In [30]:
pinecone_index.delete(delete_all=True, namespace="summary-rag-namespace")

{}

In [31]:
pinecone_index.delete(delete_all=True, namespace="summary-plus-large-context-namespace")

{}

Remove the index when you're done.

In [32]:
pinecone.delete_index("ted-talk-index")