[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/gen-qa-openai.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/gen-qa-openai.ipynb)

# Question Answering using Retrieval Augmented Generation and Cascading Retrieval

## Intro

In this notebook we will learn how to query relevant contexts to our queries from Pinecone, and pass these to a generative OpenAI model to generate an answer backed by real data sources. Specifically, we'll answer questions about a subset of research papers on AI from Arxiv.

To do so, we'll implement [cascading retrieval](https://www.pinecone.io/blog/cascading-retrieval/), a search pattern using Pinecone that involves creating sparse and dense indexes, searching a single query across both, then reranking the results to find the most relevant set of information for our query.

Then, we'll hook up a prompt to an OpenAI LLM call to generate our answer. 

Let's begin by installing our requisite packages and getting our data ready:

In [1]:
!pip install -qU \
    openai==1.86.0 \
    pinecone==7.0.2 \
    datasets==3.6.0 \
    backoff==2.2.1 \
    pinecone-notebooks==0.1.1

---
### Demo Data: Arxiv Papers

In this example, we'll show you how to build a simple RAG workflow over a set of Arxiv paper abstracts.




In [34]:
## Getting our Dataset: 

from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv2", split="train")

#Let's take a peek at the data
dataset[0]


{'id': '2401.04088',
 'title': 'Mixtral of Experts',
 'summary': 'We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.\nMixtral has the same architecture as Mistral 7B, with the difference that each\nlayer is composed of 8 feedforward blocks (i.e. experts). For every token, at\neach layer, a router network selects two experts to process the current state\nand combine their outputs. Even though each token only sees two experts, the\nselected experts can be different at each timestep. As a result, each token has\naccess to 47B parameters, but only uses 13B active parameters during inference.\nMixtral was trained with a context size of 32k tokens and it outperforms or\nmatches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,\nMixtral vastly outperforms Llama 2 70B on mathematics, code generation, and\nmultilingual benchmarks. We also provide a model fine-tuned to follow\ninstructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,\nC

In [3]:
print(dataset[0]["summary"])

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.
Mixtral has the same architecture as Mistral 7B, with the difference that each
layer is composed of 8 feedforward blocks (i.e. experts). For every token, at
each layer, a router network selects two experts to process the current state
and combine their outputs. Even though each token only sees two experts, the
selected experts can be different at each timestep. As a result, each token has
access to 47B parameters, but only uses 13B active parameters during inference.
Mixtral was trained with a context size of 32k tokens and it outperforms or
matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,
Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and
multilingual benchmarks. We also provide a model fine-tuned to follow
instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,
Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both


##Retrieval

### Setting up Access

Now the data is ready, we can set up our indexes to store it.

We begin by instantiating a Pinecone client. To do this we need a [free API key](https://app.pinecone.io). We'll also setup our OpenAI API key at this time, so be sure to use one from your account.

In [4]:

import os
from getpass import getpass

def get_pinecone_api_key():
    """
    Get Pinecone API key from environment variable or prompt user for input.
    Returns the API key as a string.

    Only necessary for notebooks. When using Pinecone yourself, 
    you can use environment variables or the like to set your API key.
    """
    api_key = os.environ.get("PINECONE_API_KEY")
    
    if api_key is None:
        try:
            # Try Colab authentication if available
            from pinecone_notebooks.colab import Authenticate
            Authenticate()
            # If successful, key will now be in environment
            api_key = os.environ.get("PINECONE_API_KEY")
        except ImportError:
            # If not in Colab or authentication fails, prompt user for API key
            print("Pinecone API key not found in environment.")
            api_key = getpass("Please enter your Pinecone API key: ")
            # Save to environment for future use in session
            os.environ["PINECONE_API_KEY"] = api_key
    
    return api_key

PINECONE_API_KEY = get_pinecone_api_key()


Pinecone API key not found in environment.


In [5]:
def get_openai_api_key():
    """
    Get OpenAI API key from environment variable or prompt user for input.
    Returns the API key as a string.
    """

    api_key = os.environ.get("OPENAI_API_KEY")
    
    if api_key is None:
        try:
            api_key = getpass("Please enter your OpenAI API key: ")
            # Save to environment for future use in session
            os.environ["OPENAI_API_KEY"] = api_key
        except Exception as e:
            print(f"Error getting OpenAI API key: {e}")
            return None
    
    return api_key

OPENAI_API_KEY = get_openai_api_key()

### Setting up our indexes

Now that we have API keys, we can setup our indexes:

In [6]:
from pinecone import Pinecone

# Configure client
pc = Pinecone(
        api_key=PINECONE_API_KEY,
        # You can remove this parameter for your own projects
        source_tag="pinecone_examples:docs:gen_qa_openai"
    )


Since we are using a dense and a sparse model, we'll need separate indexes to store embeddings created by each model. Fortunately, we can use Pinecone Integrated Inference to do this easily:

In [7]:

dense_index_name = 'gen-qa-openai-fast-dense'
sparse_index_name = 'gen-qa-openai-fast-sparse'


if not pc.has_index(dense_index_name):
    pc.create_index_for_model(
        name=dense_index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            "model":"llama-text-embed-v2",
            "field_map":{"text": "chunk_text"}
        }
    )

if not pc.has_index(sparse_index_name):
    pc.create_index_for_model(
        name=sparse_index_name,
        cloud="aws",
        region="us-east-1",
        embed={
            "model":"pinecone-sparse-english-v0",
            "field_map":{"text": "chunk_text"}
        }
    )

Now, whenever we upsert or query to an index, it will automatically identify and call the appropriate model to embed our data with.

To learn more about the strengths of our sparse model, [reference this article here](https://www.pinecone.io/learn/learn-pinecone-sparse/). To learn generally how dense models are trained and why, [check out our guide here](https://www.pinecone.io/learn/the-practitioners-guide-to-e5/).

## Preprocess our dataset

We'll need to reorient our dataset to the records format, to support using Pinecone Integrated Inference.


In [35]:
dataset = dataset.remove_columns(["content", "comment", "journal_ref", "references"])
dataset = dataset.rename_column("summary", "chunk_text")


In [36]:
dataset=dataset.to_list()

In [37]:
dataset[0]

{'id': '2401.04088',
 'title': 'Mixtral of Experts',
 'chunk_text': 'We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.\nMixtral has the same architecture as Mistral 7B, with the difference that each\nlayer is composed of 8 feedforward blocks (i.e. experts). For every token, at\neach layer, a router network selects two experts to process the current state\nand combine their outputs. Even though each token only sees two experts, the\nselected experts can be different at each timestep. As a result, each token has\naccess to 47B parameters, but only uses 13B active parameters during inference.\nMixtral was trained with a context size of 32k tokens and it outperforms or\nmatches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular,\nMixtral vastly outperforms Llama 2 70B on mathematics, code generation, and\nmultilingual benchmarks. We also provide a model fine-tuned to follow\ninstructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo,

### Embedding and Upserting our data

Below, we've written a useful upsertion script that handles rate limits nicely. We upsert in batches of 96, because this is the batch size limit for hosted models on Pinecone. This should take about a minute to upsert all of our data:

In [39]:
from tqdm import tqdm
import backoff


batch_size = 96
dense_index = pc.Index(dense_index_name)


@backoff.on_exception(backoff.expo, Exception, max_tries=8, max_time=80, on_backoff=lambda details: print(f"Backoff: {details['tries']} of 8"))
def upsert_in_batches(dataset, index, batch_size):
    for start in tqdm(range(0, len(dataset), batch_size), desc="Upserting records batch"):
        batch = dataset[start:start + batch_size]
        index.upsert_records(namespace="arxiv", records=batch)

upsert_in_batches(dataset, dense_index, batch_size)


Upserting records batch: 100%|██████████| 28/28 [00:26<00:00,  1.04it/s]


In [40]:
sparse_index = pc.Index(sparse_index_name)

upsert_in_batches(dataset, sparse_index, batch_size)

Upserting records batch: 100%|██████████| 28/28 [00:15<00:00,  1.82it/s]


Now we've added all of our papers and summaries and their embeddings to the index, we can work on the writing the cascading retrieval logic.

### Cascading Retrieval

For many AI applications, dense or sparse search alone isn't enough to find relevant results. Cascading retrieval allows us to leverage the strengths of dense models and sparse models in order to further refine search results given a user query.

To implement cascading retrieval, we will need to conduct the following steps:

1. Query: We query our sparse and dense indexes with a large top_k value
2. Deduplicate: We combine and deduplicate results from the previous queries to achieve a preliminary result set
3. Rerank: We pass this result set, and the query, to a Rerank endpoint hosted on Pinecone, returning our final set of documents

The final set of documents can be considered the most relevant queries across dense and sparse search methods for the user.

To learn more about exactly how reranking models can help improve relevance of search results, [check out our article here](https://www.pinecone.io/learn/refine-with-rerank/).

In [57]:

def query_pinecone(query, index, top_k=20):
    # queries Pinecone given an index. 
    # The search endpoint embeds and queries the index with the model specified in index creation.
    # Returns the top_k results ordered by relevance to the query

    results = index.search(
        namespace="arxiv", 
        query={
            "inputs": {"text": query}, 
            "top_k": top_k}
        )
    
    return results

def dedupe(sparse_results, dense_results):
    # Deduplicates results from sparse and dense indexes


    # Deduplicate by _id
    deduped_hits = {hit['_id']: hit for hit in sparse_results['result']['hits'] + dense_results['result']['hits']}.values()
    # Transform to format for reranking
    result = [{'id': hit['_id'], "title": hit['fields']['title'], 'summary': hit['fields']['chunk_text']} for hit in deduped_hits]
    return result


def rerank(query, results, top_n=5):
    # Calls a reranking model hosted on Pinecone, returning the top_n results ordered by relevance to the query
    result = pc.inference.rerank(
        # free-tier reranker
        model="bge-reranker-v2-m3",
        query=query,
        documents=results,
        rank_fields=["summary"],
        top_n=10,
        return_documents=True,
        parameters={
            "truncate": "END"
        }
    )
    # rerank results
    return result

def cascading_retrieval(query, sparse_index, dense_index,top_k=10, top_n=5):
    # Conducts a search over sparse and dense indexes, followed by deduplication and reranking
    # Returns a list of top_n results ordered by relevance to the query

    # query dense
    dense_results = query_pinecone(query, dense_index, top_k)
    # query sparse
    sparse_results = query_pinecone(query, sparse_index, top_k)
    # dedupe results
    deduped_results = dedupe(sparse_results, dense_results)
    # rerank results
    reranked_results = rerank(query, deduped_results, top_n)
    reranked_results = reranked_results.data

    return reranked_results


We can take a look at the set of returned results here:

In [58]:
query = "What are Mixture of Expert models, compared to regular transformer models?"

results = cascading_retrieval(query, sparse_index, dense_index, top_k=10, top_n=5)


for result in results:
    print(result["document"]["id"])
    print(result["document"]["title"])
    print("\n")    
    print(result["document"]["summary"])
    print("\n")

2112.14397
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate


Mixture-of-experts (MoE) is becoming popular due to its success in improving
the model quality, especially in Transformers. By routing tokens with a sparse
gate to a few experts (i.e., a small pieces of the full model), MoE can easily
increase the model parameters to a very large scale while keeping the
computation cost in a constant level. Most existing works just initialize some
random experts, set a fixed gating strategy (e.g., Top-k), and train the model
from scratch in an ad-hoc way. We identify that these MoE models are suffering
from the immature experts and unstable sparse gate, which are harmful to the
convergence performance. In this paper, we propose an efficient end-to-end MoE
training framework called EvoMoE. EvoMoE starts from training one single expert
and gradually evolves into a large and sparse MoE structure. EvoMoE mainly
contains two phases: the expert-diversify phase 

## Augmentation

### Prompt Creation
Next, we write some functions to retrieve these relevant contexts from Pinecone and incorporate them into a richer chat completion prompt. This is the "augmentation" part of retrieval augmented generation.


In [60]:
def retrieval_augmented_prompt(query, sparse_index, dense_index, top_k=10, top_n=5):

    # Get relevant contexts
    query_results = cascading_retrieval(query, sparse_index, dense_index, top_k, top_n)

    # Build our prompt with the retrieved contexts included
    prompt_start = (
        "Answer the question based on the retrieved papers below.\n\n"+
        "Retrieved Papers:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    context_separator = "\n\n---\n\n"

    # Join contexts and trim to fit within limit
    combined_contexts = []
    total_length = 0
    
    for context in query_results:
        formatted_paper = f"Result Title: {context['document']['title']}\n Result Summary: {context['document']['summary']}"
        combined_contexts.append(formatted_paper)
    
    return prompt_start + context_separator.join(combined_contexts) + prompt_end

In [62]:
prompt_with_context = retrieval_augmented_prompt(query, sparse_index, dense_index, top_k=10, top_n=5)
print(prompt_with_context)

Answer the question based on the retrieved papers below.

Retrieved Papers:
Result Title: EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate
 Result Summary: Mixture-of-experts (MoE) is becoming popular due to its success in improving
the model quality, especially in Transformers. By routing tokens with a sparse
gate to a few experts (i.e., a small pieces of the full model), MoE can easily
increase the model parameters to a very large scale while keeping the
computation cost in a constant level. Most existing works just initialize some
random experts, set a fixed gating strategy (e.g., Top-k), and train the model
from scratch in an ad-hoc way. We identify that these MoE models are suffering
from the immature experts and unstable sparse gate, which are harmful to the
convergence performance. In this paper, we propose an efficient end-to-end MoE
training framework called EvoMoE. EvoMoE starts from training one single expert
and gradually evolves into a 

## Generation: Calling OpenAI to create a response

Now that we are building a rich prompt with context from our index, we are ready to get chat completions from OpenAI.

In [63]:
from openai import OpenAI

def chat_completion(prompt):

    # Instantiate the OpenAI client
    client = OpenAI(api_key=OPENAI_API_KEY)
    
    # Instructions
    sys_prompt = "You are a helpful assistant that always answers questions."
    res = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[
            {"role": "system", "content": sys_prompt},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    return res.choices[0].message.content.strip()

### Final workflow

To implement RAG, we simply create our prompt using cascading retrieval results, and pass it to an OpenAI LLM call!

In [64]:
def rag(query):
    # Retrieval relevant results, format into a prompt, and pass to OpenAI
    
    prompt = retrieval_augmented_prompt(query, sparse_index, dense_index, top_k=20, top_n=5)
    return chat_completion(prompt)

## Results

In [65]:
query = "What are Mixture of Expert models, compared to regular transformer models?"


answer = rag(query)
print(answer)

Mixture of Experts (MoE) models are a specialized architecture within the realm of deep learning, particularly designed to enhance the efficiency and scalability of neural networks, especially in comparison to regular transformer models. Here are the key differences:

1. **Parameter Utilization**: MoE models utilize a sparse activation mechanism, where only a subset of the model's parameters (experts) is activated for each input. This allows MoE models to have a significantly larger number of parameters while maintaining a constant computational cost per input. In contrast, regular transformer models typically use the same parameters for all inputs, leading to a direct correlation between the number of parameters and the computational cost.

2. **Dynamic Routing**: In MoE models, tokens are routed to different experts based on a gating mechanism, which can adaptively select which experts to activate for a given input. This dynamic routing allows MoE models to handle varying complexitie

And we get a pretty great answer straight away, specifying to use _multiple-rankings loss_ (also called _multiple negatives ranking loss_).

# Demo cleanup

Once we're done with the index we can delete our index to save resources:

In [None]:
pc.delete_index(name=dense_index_name)
pc.delete_index(name=sparse_index_name)

---