# Retrieval-Augmented Generation with hybrid search and Claude 

This notebook demonstrates how to implement retrieval-augmented generation (RAG), connecting Anthropic's Claude models with the data in your Pinecone vector database. We will cover the following steps:

1. Setup: Setup and set Pinecone and Anthropic API keys
2. Ingestion: Embedding and upserting data into Pinecone using integrated inference
3. Retrieval: Querying a dense and a sparse index from Pinecone to retrieve results using hybrid search
4. Augmentation: Prepare the prompt
5. Generation: Using Claude to answer questions with information from the database

This notebook accompanies this [Retrieval-Augmented Generation article](https://www.pinecone.io/learn/retrieval-augmented-generation/).

## 1. Setup
First, let's install the necessary libraries and set the API keys we will need to use in this notebook.

In [48]:
%pip install -qU \
     anthropic==0.54.0 \
     pinecone==7.0.2 \
     pinecone-notebooks==0.1.1 \
     datasets==3.6.0 \
     backoff==2.2.1

Note: you may need to restart the kernel to use updated packages.


### Get and set the Pinecone API key

We will need a free [Pinecone API key](https://docs.pinecone.io/guides/get-started/quickstart). The code below will either authenticate you and set the API key as an environment variable or will prompt you to enter the API key and then set it in the environment.

In [49]:
import os
from getpass import getpass

def get_pinecone_api_key():
    """
    Get Pinecone API key from environment variable or prompt user for input.
    Returns the API key as a string.

    Only necessary for notebooks. When using Pinecone yourself, 
    you can use environment variables or the like to set your API key.
    """
    api_key = os.environ.get("PINECONE_API_KEY")
    
    if api_key is None:
        try:
            # Try Colab authentication if available
            from pinecone_notebooks.colab import Authenticate
            Authenticate()
            # If successful, key will now be in environment
            api_key = os.environ.get("PINECONE_API_KEY")
        except ImportError:
            # If not in Colab or authentication fails, prompt user for API key
            print("Pinecone API key not found in environment.")
            api_key = getpass("Please enter your Pinecone API key: ")
            # Save to environment for future use in session
            os.environ["PINECONE_API_KEY"] = api_key
            print("Pinecone API key saved to environment.")
            
    return api_key

PINECONE_API_KEY = get_pinecone_api_key()

### Set the Anthropic API key

Next, we'll need to get a [Claude API key](https://docs.anthropic.com/en/api/overview). The code below will prompt you to enter it and then set it in the environment.

In [50]:
def get_anthropic_api_key():
    """
    Get Anthropic API key from environment variable or prompt user for input.
    Returns the API key as a string.
    """

    api_key = os.environ.get("ANTHROPIC_API_KEY")
    
    if api_key is None:
        try:
            api_key = getpass("Please enter your Anthropic API key: ")
            # Save to environment for future use in session
            os.environ["ANTHROPIC_API_KEY"] = api_key
        except Exception as e:
            print(f"Error getting Anthropic API key: {e}")
            return None
    
    return api_key

ANTHROPIC_API_KEY = get_anthropic_api_key()

## 2. Ingestion

### Load the dataset

In this example, we'll show you how to build a simple RAG workflow over a set of Arxiv paper abstracts.

In [53]:
from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv2", split="train")

#Let's take a peek at the data
dataset[0]

{'id': '2401.04088',
 'title': 'Mixtral of Experts',
 'summary': 'We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.\nMixtral has the same architecture as Mistral 7B, with the difference that each\nlayer is composed of 8 feedforward blocks (i.e. experts). For every token, at\neach layer, a router network selects two experts to process the current state\nand combine their outputs. Even though each token only sees two experts, the\nselected experts can be different at each timestep. As a result, each token has\naccess to 47B parameters, but only uses 13B active parameters during inference.\nMixtral was trained with a context size of 32k tokens and it outperforms or\nmatches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,\nMixtral vastly outperforms Llama 2 70B on mathematics, code generation, and\nmultilingual benchmarks. We also provide a model fine-tuned to follow\ninstructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,\nC

### Preprocess the dataset

We'll need to reorient our dataset to the records format, to support using Pinecone Integrated Inference. We only need some of the columns and we want the summary column renamed to chunk_text.

In [54]:
dataset = dataset.remove_columns(["content", "comment", "journal_ref", "references"])
dataset = dataset.rename_column("summary", "chunk_text")

dataset=dataset.to_list()
dataset[0]



{'id': '2401.04088',
 'title': 'Mixtral of Experts',
 'chunk_text': 'We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.\nMixtral has the same architecture as Mistral 7B, with the difference that each\nlayer is composed of 8 feedforward blocks (i.e. experts). For every token, at\neach layer, a router network selects two experts to process the current state\nand combine their outputs. Even though each token only sees two experts, the\nselected experts can be different at each timestep. As a result, each token has\naccess to 47B parameters, but only uses 13B active parameters during inference.\nMixtral was trained with a context size of 32k tokens and it outperforms or\nmatches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,\nMixtral vastly outperforms Llama 2 70B on mathematics, code generation, and\nmultilingual benchmarks. We also provide a model fine-tuned to follow\ninstructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,

### Pinecone vector database and integrated inference

We'll use hybrid search to implement retrieval. We'll create separate dense and sparse indexes, upsert dense vectors into the dense index and sparse vectors into the sparse index, and search each index separately. Then we'll combine and deduplicate the results, use one of Pinecone’s hosted reranking models to rerank them based on a unified relevance score, and then return the most relevant matches.

We'll use integrated inference, so when creating the indexes, we'll specify a Pinecone-hosted model to use for embedding queries and documents. Pinecone handles the embedding for us,
so we can pass it text directly. Learn more about hybrid search [here](https://docs.pinecone.io/guides/search/hybrid-search) and integrated inference [here](https://docs.pinecone.io/guides/index-data/indexing-overview#integrated-embedding).

Now, we can initialize the dense index, using the `llama-text-embed-v2` embedding model.

In [55]:
from pinecone import Pinecone

pc = Pinecone(
    api_key=PINECONE_API_KEY,
    # You can remove this parameter for your own projects
    source_tag="pinecone_examples:learn:generation:traditional_rag:traditional_rag_with_claude_and_hybrid"
)

dense_index_name = "traditional-rag-with-claude-dense"
if not pc.has_index(dense_index_name):
    pc.create_index_for_model(
        name=dense_index_name,
        cloud="aws",
        region="us-east-1",
        # Chunk text will be the field we embed from our documents
        embed={
            "model":"llama-text-embed-v2",
            "field_map":{"text": "chunk_text"}
        }
    )

dense_index = pc.Index(dense_index_name)
dense_index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'dense'}

Next, we'll initialize the sparse index, using the `pinecone-sparse-english-v0` sparse embedding model.

In [56]:
sparse_index_name = "traditional-rag-with-claude-sparse"
if not pc.has_index(sparse_index_name):
    pc.create_index_for_model(
        name=sparse_index_name,
        cloud="aws",
        region="us-east-1",
        # Chunk text will be the field we embed from our documents
        embed={
            "model":"pinecone-sparse-english-v0",
            "field_map":{"text": "chunk_text"}
        }
    )

sparse_index = pc.Index(sparse_index_name)
sparse_index.describe_index_stats()


{'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {},
 'total_vector_count': 0,
 'vector_type': 'sparse'}

We should see that the two new Pinecone indexes both have a total_vector_count of 0, as we haven't added any vectors yet.

Now, whenever we upsert to or query an index, Pinecone will automatically embed the text data with the specified embedding model.

To learn more about the strengths of our sparse model, reference [this article here](https://www.pinecone.io/learn/learn-pinecone-sparse/). To learn generally how dense models are trained and why, check out our guide [here](https://www.pinecone.io/learn/the-practitioners-guide-to-e5/).


### Embedding and upserting data to Pinecone 

With our indexes set up, we can now take our data and upsert them to each index. Here is a useful upsertion script that handles rate limits nicely. We [upsert in batches of 96](https://docs.pinecone.io/guides/index-data/upsert-data#upsert-in-batches), because this is the batch size limit for hosted models on Pinecone. This should take about a minute to upsert all of our data:

In [57]:
from tqdm import tqdm
import backoff


batch_size = 96

@backoff.on_exception(backoff.expo, Exception, max_tries=8, max_time=80, on_backoff=lambda details: print(f"Backoff: {details['tries']} of 8"))
def upsert_in_batches(dataset, index, batch_size):
    for start in tqdm(range(0, len(dataset), batch_size), desc="Upserting records batch"):
        batch = dataset[start:start + batch_size]
        index.upsert_records(namespace="arxiv", records=batch)

upsert_in_batches(dataset, dense_index, batch_size)
upsert_in_batches(dataset, sparse_index, batch_size)

Upserting records batch: 100%|██████████| 28/28 [00:23<00:00,  1.20it/s]
Upserting records batch: 100%|██████████| 28/28 [00:18<00:00,  1.49it/s]


In [58]:
print("dense index stats:")
dense_index.describe_index_stats()

dense index stats:


{'dimension': 1024,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'arxiv': {'vector_count': 2673}},
 'total_vector_count': 2673,
 'vector_type': 'dense'}

In [59]:
print("sparse index stats:")
sparse_index.describe_index_stats()

sparse index stats:


{'index_fullness': 0.0,
 'metric': 'dotproduct',
 'namespaces': {'arxiv': {'vector_count': 2673}},
 'total_vector_count': 2673,
 'vector_type': 'sparse'}

## 3. Retrieval using hybrid search

### Perform semantic search

With our indexes populated, we can start making queries to get results.

We'll first query the dense index to find the 5 records most semantically related to the natural language query. Because the index is integrated with an embedding model, you provide the query as text and Pinecone converts the text to a dense vector automatically.

In [60]:
USER_QUESTION = "What are Mixture of Expert models, compared to regular transformer models?"

def search_index(index, question):
    # search_records embeds and queries the Pinecone index in one step
    results = index.search(
        namespace="arxiv", 
        query={
            # specifies number of results to return
            "top_k":5,
            # specifies the query to embed and search for
            "inputs":{
                "text": question
            }
        }
    )

    return results["result"]["hits"]

dense_results = search_index(dense_index, USER_QUESTION)

for num, result in enumerate(dense_results):
    # Return result and score
    print(f"Result {num+1}:")
    print(result["_id"])
    print(result["fields"]["chunk_text"])
    print(result["_score"])
    print("\n")

Result 1:
2112.14397
Mixture-of-experts (MoE) is becoming popular due to its success in improving
the model quality, especially in Transformers. By routing tokens with a sparse
gate to a few experts (i.e., a small pieces of the full model), MoE can easily
increase the model parameters to a very large scale while keeping the
computation cost in a constant level. Most existing works just initialize some
random experts, set a fixed gating strategy (e.g., Top-k), and train the model
from scratch in an ad-hoc way. We identify that these MoE models are suffering
from the immature experts and unstable sparse gate, which are harmful to the
convergence performance. In this paper, we propose an efficient end-to-end MoE
training framework called EvoMoE. EvoMoE starts from training one single expert
and gradually evolves into a large and sparse MoE structure. EvoMoE mainly
contains two phases: the expert-diversify phase to train the base expert for a
while and spawn multiple diverse experts from i

### Perform lexical search

Now we'll perform a lexical search by searching the sparse index for the 5 records that most exactly match the words in the query. This is an advanced variant of keyword search, and uses the `pinecone-sparse-english-v0` sparse embedding model that is optimized for keyword search.

In [61]:
sparse_results = search_index(sparse_index, USER_QUESTION)

for num, result in enumerate(dense_results):
    # Return result and score
    print(f"Result {num+1}:")
    print(result["_id"])
    print(result["fields"]["chunk_text"])
    print(result["_score"])
    print("\n")



Result 1:
2112.14397
Mixture-of-experts (MoE) is becoming popular due to its success in improving
the model quality, especially in Transformers. By routing tokens with a sparse
gate to a few experts (i.e., a small pieces of the full model), MoE can easily
increase the model parameters to a very large scale while keeping the
computation cost in a constant level. Most existing works just initialize some
random experts, set a fixed gating strategy (e.g., Top-k), and train the model
from scratch in an ad-hoc way. We identify that these MoE models are suffering
from the immature experts and unstable sparse gate, which are harmful to the
convergence performance. In this paper, we propose an efficient end-to-end MoE
training framework called EvoMoE. EvoMoE starts from training one single expert
and gradually evolves into a large and sparse MoE structure. EvoMoE mainly
contains two phases: the expert-diversify phase to train the base expert for a
while and spawn multiple diverse experts from i

### Merge and deduplicate the results

Next, we'll merge the dense and sparse results and deduplicated them based on the field we used to link sparse and dense vectors, in our case `_id`.

In [62]:
def merge_chunks(sparse_results, dense_results):
    """Get the unique hits from two search results and return them as single array of {'_id', 'chunk_text'} dicts, printing each dict on a new line."""
    # Deduplicate by _id
    deduped_hits = {hit['_id']: hit for hit in sparse_results + dense_results}.values()
    # Sort by _score descending
    sorted_hits = sorted(deduped_hits, key=lambda x: x['_score'], reverse=True)
    # Transform to format for reranking
    result = [{'_id': hit['_id'], 'chunk_text': hit['fields']['chunk_text']} for hit in sorted_hits]
    return result

merged_results = merge_chunks(sparse_results, dense_results)

print('[\n   ' + ',\n   '.join(str(obj) for obj in merged_results) + '\n]')

[
   {'_id': '2206.03382', 'chunk_text': 'Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep\nlearning models to trillion-plus parameters with fixed computational cost. The\nalgorithmic performance of MoE relies on its token routing mechanism that\nforwards each input token to the right sub-models or experts. While token\nrouting dynamically determines the amount of expert workload at runtime,\nexisting systems suffer inefficient computation due to their static execution,\nnamely static parallelism and pipelining, which does not adapt to the dynamic\nworkload. We present Flex, a highly scalable stack design and implementation\nfor MoE with dynamically adaptive parallelism and pipelining. Flex designs an\nidentical layout for distributing MoE model parameters and input data, which\ncan be leveraged by all possible parallelism or pipelining methods without any\nmathematical inequivalence or tensor migration overhead. This enables adaptive\nparallelism/pipelinin

### Rerank the results

We'll use one of Pinecone’s hosted reranking models, `bge-reranker-v2-m3` to rerank the merged and deduplicated results based on a unified relevance score.

In [63]:
def rerank_results(question, results):
    result = pc.inference.rerank(
        model="bge-reranker-v2-m3",
        query=question,
        documents=results,
        rank_fields=["chunk_text"],
        top_n=5,
        return_documents=True,
        parameters={
            "truncate": "END"
        }
    )
    return result.data

reranked_results = rerank_results(USER_QUESTION, merged_results)

print("Query", USER_QUESTION)
print('-----')
for row in reranked_results:
    print(row['document']['_id'])
    print(round(row['score'], 2))
    print(row['document']['chunk_text'])
    print("\n")

Query What are Mixture of Expert models, compared to regular transformer models?
-----
2112.14397
0.82
Mixture-of-experts (MoE) is becoming popular due to its success in improving
the model quality, especially in Transformers. By routing tokens with a sparse
gate to a few experts (i.e., a small pieces of the full model), MoE can easily
increase the model parameters to a very large scale while keeping the
computation cost in a constant level. Most existing works just initialize some
random experts, set a fixed gating strategy (e.g., Top-k), and train the model
from scratch in an ad-hoc way. We identify that these MoE models are suffering
from the immature experts and unstable sparse gate, which are harmful to the
convergence performance. In this paper, we propose an efficient end-to-end MoE
training framework called EvoMoE. EvoMoE starts from training one single expert
and gradually evolves into a large and sparse MoE structure. EvoMoE mainly
contains two phases: the expert-diversify ph

## 4. Augmentation

Next, we'll prepare the prompt with the search results as context for the next step, generation. Let's format them into a [search template using techniques](https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview) Claude has been trained with and add the formatted descriptions to a prompt. We'll use this prompt to send the search results as context to the generation step.

### Prompt creation

In [64]:
# Formatting search results
def format_results(extracted: list[str]) -> str:
        result = "\n".join(
            [
                f'<item index="{i+1}">\n<page_content>\n{r["document"]["chunk_text"]}\n</page_content>\n</item>'
                for i, r in enumerate(extracted)
            ]
        )
    
        return f"\n<search_results>\n{result}\n</search_results>"

def augment_prompt(results_list, question):
    return f"""\n\nHuman: {format_results(results_list)} Using the search results provided within the <search_results></search_results> tags, please answer the following question <question>{question}</question>. Do not reference the search results in your answer.\n\nAssistant:"""


In [65]:
print(augment_prompt(reranked_results, USER_QUESTION))



Human: 
<search_results>
<item index="1">
<page_content>
Mixture-of-experts (MoE) is becoming popular due to its success in improving
the model quality, especially in Transformers. By routing tokens with a sparse
gate to a few experts (i.e., a small pieces of the full model), MoE can easily
increase the model parameters to a very large scale while keeping the
computation cost in a constant level. Most existing works just initialize some
random experts, set a fixed gating strategy (e.g., Top-k), and train the model
from scratch in an ad-hoc way. We identify that these MoE models are suffering
from the immature experts and unstable sparse gate, which are harmful to the
convergence performance. In this paper, we propose an efficient end-to-end MoE
training framework called EvoMoE. EvoMoE starts from training one single expert
and gradually evolves into a large and sparse MoE structure. EvoMoE mainly
contains two phases: the expert-diversify phase to train the base expert for a
while and

## 5. Generation: Answering with Claude

Finally, let's ask the original user's question and get our answer from Claude.

In [66]:
import anthropic

client = anthropic.Anthropic()
model = "claude-3-5-haiku-latest"

def get_completion(prompt):
    message = client.messages.create(
        model=model,
        max_tokens=1000,
        temperature=1,
        # system="You are a keyword generating assistant. Given a user message, you'll generate keywords to search for products.",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"{prompt}"
                    }
                ]
            }
        ]
    )
    return message.content

In [67]:
answer = get_completion(augment_prompt(reranked_results, USER_QUESTION))

print(answer[0].text)


Based on the search results, Mixture of Experts (MoE) models differ from regular transformer models in several key ways:

1. Scalability: MoE models can significantly increase model parameters without proportionally increasing computational costs. They achieve this by routing tokens to only a few "experts" (specialized sub-networks) instead of processing the entire model for each token.

2. Dynamic Routing: Unlike traditional transformers with static architectures, MoE models dynamically route tokens to different experts based on the input's complexity. This allows the model to adapt more flexibly to varying input characteristics.

3. Computational Efficiency: MoE models keep computation costs constant while allowing for much larger model sizes. This means they can scale up the number of parameters without a linear increase in computational resources.

4. Sparsity: These models use sparse activation, where only a subset of experts (e.g., top-k experts) are activated for processing each

## Putting it all together

Let's create one final function to bring together the hybrid search, including searching the dense index, sparse index, merging and deduplicating the results and then reranking them.

In [68]:
def hybrid_search(question):
    # Hybrid search

    ## Search dense index
    dense_results = search_index(dense_index, question)

    ## Search sparse index
    sparse_results = search_index(sparse_index, question)

    ## Merge and deduplicate results
    merged_results = merge_chunks(sparse_results, dense_results)
    
    ## Rerank results
    reranked_results = rerank_results(question, merged_results)
    
    return reranked_results

And now we can conduct the final three steps of retrieval-augmented generation:

1. Retrieval using hybrid search
2. Augmenting the prompt with the search results
3. Sending the augmented prompt to the LLM to generate an answer

In [69]:
hybrid_results = hybrid_search(USER_QUESTION)

## Augment prompt with search results
prompt = augment_prompt(hybrid_results, USER_QUESTION)

## Answer the question
answer = get_completion(prompt)

print(answer[0].text)

Based on the search results, Mixture of Experts (MoE) models differ from regular transformer models in several key ways:

1. Scalability: MoE models can significantly increase the number of model parameters while keeping computational costs constant. This allows for larger models without proportionally increasing computational requirements.

2. Dynamic Routing: Unlike traditional transformer models with fixed processing, MoE models use a "sparse gate" mechanism that routes tokens to only a few specialized "experts" (sub-networks) based on the input. This means different parts of the input can be processed by different specialized neural network components.

3. Computational Efficiency: MoE models can adapt to the complexity of input instances dynamically, routing tokens to the most appropriate experts rather than processing everything through the entire network uniformly.

4. Flexibility: These models can be more adaptable, with the ability to develop diverse expert sub-networks that c

## Cleanup Indexes

Run these when you're done experimenting, to delete the indexes.

In [70]:
pc.delete_index(name=sparse_index_name)
pc.delete_index(name=dense_index_name)
