In [None]:
# Copyright 2024 NVIDIA Corporation. All Rights Reserved.

# Each user is responsible for checking the content of datasets and the
# applicable licenses and determining if suitable for the intended use.

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width:60px; float:right"><br>
# <font color="#76b900">**Build a RAG Pipeline**<br/>with NVIDIA AI Foundation Models & Endpoints</font>

**Welcome To Your Cloud Environment!** This interactive web application, which you're currently using to run Python code, is more than just a simple interface. When you access this Jupyter Notebook, an instance on a cloud platform is allocated to you by the [**NVIDIA Deep Learning Institute (DLI)**](https://www.nvidia.com/en-us/training/). This forms your base cloud environment, essentially a blank canvas for further setup, and includes:

- A dedicated CPU, and possibly a GPU, for processing.
- A pre-installed base operating system.
- A pre-installation of packages necessary to run the lab.

### Learning Objectives 

In this tutorial, we will be building a **Chat-with-your-Documents RAG pipeline** using [**NVIDIA AI Foundation Models**](https://build.nvidia.com/explore/discover), accessed through endpoints hosted in NGC. These endpoints can be used by developers and data scientist to easily build PoC (proof of concept) applications that use hosted instances just like OpenAI.

On the other hand, NVIDIA also provides **a suite of microservies (NIMs)**, where you can use your on-prem infra or DGX Cloud and move the exact same models into self-managed hosting by changing a few lines of code. NVIDIA microservices can scale out based on load, and run entirely on the GPUs: vector DB, embedding and LLM inference.

Here, we will focus on building a RAG pipeline using NVIDIA AI Foundation Models and will learn:

-  The components of a RAG pipeline
    - Document loading
    - Document preprocessing
    - Generating embeddings for the chunked documents
    - Indexing with Vector Database
    - LLM (generator)
- How to use *NVIDIA AI Foundation Endpoints* for easy access to hosted endpoints for generative AI models like Mistral, Llama-2, SteerLM, etc.
- How to send a query and get a response from an LLM model
- Why evaluation is a critical aspect of building and deploying RAG pipelines?

### Setting Your API Key

In order to successfully run this notebook, you will need an **NVIDIA API KEY**.  If you did not create your API Key yet, please go through the setup steps in the [**Introduction**](./Introduction.ipynb) notebook and generate an API key with your own user account. You can supply the `NVIDIA_API_KEY` directly in this notebook when you run the cell below:

In [None]:
import getpass
import os

## API Key can be found by going to build.nvidia.com -> (Hosted Model of Choice) -> Get API Code or similar.

# del os.environ['NVIDIA_API_KEY']  ## delete key and reset
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

<br/>

## <font color="#76b900">**1. NVIDIA AI Foundation Endpoints**</font>


[**NVIDIA AI Foundation Endpoints**](https://www.nvidia.com/en-us/ai-data-science/foundation-models/) give users easy access to NVIDIA hosted API endpoints for powerful models like Mixtral 8x7B, Llama 2, Stable Diffusion, etc. These models are optimized, tested, and hosted on the NVIDIA AI platform, making them fast and easy to evaluate, further customize, and seamlessly run at peak performance on any accelerated stack.

- **For Designing:** With NVIDIA AI Foundation Endpoints, you can get quick results from a fully accelerated stack running on NVIDIA DGX Cloud. NVIDIA AI Foundation Models are freely available to experiment with now on the NVIDIA [**NGC catalog**](https://catalog.ngc.nvidia.com/ai-foundation-models) and Hugging Face.
- **For Deploying:** Once you're ready to move to production, these models can be deployed anywhere with enterprise-grade security, stability, and support using [**NVIDIA AI Enterprise**](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/) or manually deployed on self-managed infrastructure using tools like [**TensorRT-LLM**](https://github.com/NVIDIA/TensorRT-LLM) and [**NIM**](https://developer.nvidia.com/nemo-microservices-early-access).

Using NeMo Retriever, enterprises can connect their LLMs to multiple data sources and knowledge bases, so that users can easily interact with data and receive accurate, up-to-date answers using simple, conversational prompts. 


### Getting Started With Langchain?

We asked the question `What is Langchain?` to one of the strongest open-sourced LLM models out there: `mixtral_8x7b`. It generated the answer below:

```
In summary, LangChain is a decentralized platform that leverages blockchain technology to provide a secure and transparent marketplace for language services, while Langchain is a language learning platform that utilizes AI technology to provide personalized language learning content.
```

This does not look like the answer we are looking for, right? :) 

From its developers, [LangChain](https://github.com/langchain-ai/langchain) is defined as "a framework for developing applications powered by language models." Available in Python and TypeScript (JS), the LangChain framework contains:
- Interfaces and integrations for a variety of LLM building blocks (chain components).
- A basic runtime scheme for combining these components into complex chains and agents.
- Easy-to-use off-the-shelf implementations of useful LLM pipelines.

To help interface with this framework, the [**langchain-nvidia-ai-endpoints package**](https://github.com/langchain-ai/langchain-nvidia) provides connectors like [**`ChatNVIDIA`** ](https://github.com/langchain-ai/langchain-nvidia/blob/main/libs/ai-endpoints/langchain_nvidia_ai_endpoints/chat_models.py) and [**`NVIDIAEmbeddings`** ](https://github.com/langchain-ai/langchain-nvidia/blob/main/libs/ai-endpoints/langchain_nvidia_ai_endpoints/embeddings.py) to help interface with the raw endpoints. These will be used throughout the course to power our RAG pipeline!

In [None]:
## Import the libraries
import faiss
from operator import itemgetter
from langchain.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter, SentenceTransformersTokenTextSplitter
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain_community.document_loaders import TextLoader

In [None]:
ChatNVIDIA.get_available_models()

We initialize the llm model by calling the `ChatNVIDIA` class with our model of choice.

In [None]:
llm = ChatNVIDIA(model="ai-mistral-7b-instruct-v2", nvidia_api_key=nvapi_key, max_tokens=1024)

From there, we can use LangChain chat model methods like `invoke` to generate the response or `stream` to pull response results as they generate. 

In [None]:
result = llm.invoke("Write a three line of poem about NVIDIA Nemo.")
print(result.content)

# for token in llm.stream("Write a three line of poem about NVIDIA Nemo."):
#     print(token.content, end="")

Looking at our list of models, a few general examples stand out: 
- `steerlm_llama_70b`: This model was trained with SteerLM approach developed by the NVIDIA NeMo Team, introduced as part of NVIDIA NeMo Alignment methods. It simplifies the customization of large language models (LLMs) and empowers users with dynamic control over model outputs by specifying desired attributes. You can read more about the model [here](https://arxiv.org/abs/2310.05344) and [here](https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/steerlm.html). You can try `NV-Llama2-70B-SteerLM-Chat` on the [**NVIDIA NGC Catalog**](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/llama2-70b-steerlm).
- `mixtral_8x7b-instruct`: When using this model, it'd be better to set the [instruct format](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1#instruction-format) as recommended. Please keep in mind that multi-functional, more accurate large models are slower during inference & expensive to deploy. [NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. `TensorRT-LLM` consists of the TensorRT deep learning compiler and includes optimized kernels, pre– and post-processing steps, and multi-GPU/multi-node communication primitives for groundbreaking performance on NVIDIA GPUs.

For the purposes of this talk, we will mostly use Mixtral since it works nicely by defaults and is good about following instructions. This makes it a great candidate for as a language-reasoning backbone that can be guided by context! 

### Guiding LLM Generation

In this talk, we will use our chosen LLM as a **"generator"** to give us some natural language responses. As these models have been trained to predict reasonable continuations and answer questions, they can be used as-is to predict average-human responses by default. 

Let's ask a question to our language model without providing the context.

In [None]:
result = llm.invoke("who played Oppenheimer in the  Oppenheimer movie?")
print(result.content)

This is not the answer we are looking for, since this data is new and beyond its training data :)

To help it out, let's provide our model some **context** that it can use and ask the question again. 

In [None]:
prompt = """
Oppenheimer is a 2023 epic biographical drama film written for the screen and directed by Christopher Nolan. 
It stars Cillian Murphy as Oppenheimer, the American theoretical physicist credited with being the "father of the atomic bomb" 
for his role in the Manhattan Project—the World War II undertaking that developed the first nuclear weapons. 

who played Oppenheimer in the  Oppenheimer movie?
"""

In [None]:
result = llm.invoke(prompt)
print(result.content)

Yes, this time we got our answer! 

In this case, while the LLM is treated as a generator, the "context" is used to drive the generation in a desirable direction. This example just specifies it manually, but some other schemes could be used to pre-load this context or add surrounding instructions. 

### Constructing A Useful Prompt

Now, let's do the same with a prompt template. 
- A **prompt template** is a pre-defined format that locks in certain aspects of a model component's inputs (i.e. instructions, question, etc.).
- A **prompt** is the filled-in version of the template that is fed to the LLM to generate its output. 

We use [**ChatPromptTemplate**](https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html#) class to create a prompt template for chat models.

In [None]:
context = """
Oppenheimer is a 2023 epic biographical drama film written for the screen and directed by Christopher Nolan. 
It stars Cillian Murphy as Oppenheimer, the American theoretical physicist credited with being the "father of the atomic bomb" 
for his role in the Manhattan Project—the World War II undertaking that developed the first nuclear weapons. 
"""

In [None]:
from langchain.schema.runnable import RunnableLambda

from functools import partial
from operator import itemgetter

prompt = ChatPromptTemplate.from_messages([
    (
        "system", 
        "Answer solely based on the following context. Provide concise answer: {context}"
    ),
    ("user", "{question}"),
])

model = ChatNVIDIA(model="ai-mistral-7b-instruct-v2", max_tokens=1024)

def print_return(d):
    print(repr(d))
    return d

chain = (
    {"context": lambda x: context, "question": RunnablePassthrough()}
    | prompt
    | print_return  ## <- Include to see what gets passed to the model
    | model
    | StrOutputParser()
)

When we call the chain on an input query, with the `print_return` function, we can also see the entire input goes into the model.

In [None]:
response = chain.invoke("who played Oppenheimer in the movie Oppenheimer?")

In [None]:
print(response)

**This time we got a concise answer!**

In [None]:
response = chain.invoke("who is the producer in the Oppenheimer movie?")

In [None]:
print(response)

**Without the context, this is expected since the model does not have access to this information and was not trained on it.**

Let's explain this further in the following section.

<br/>

## <font color="#76b900">**2. Retrieval-Augmented Generation (RAG) for Q&A**</font>

What just happened? As you just experienced, LLMs might not be responding reasonably to every question we ask. Although Large Language Models (LLMs) show promising results in understanding and generating text, they might have certain issues such as hallucination, limitation of reasoning, bias in their response, to name a few.
These pitfalls might occur due to:
- **Domain knowledge deficit**
- **Outdated information**
- **Catastrophic forgetting**

You can read more about these factors and solutions [here](https://arxiv.org/pdf/2312.05934.pdf). **Retrieval Augmented Generation (RAG)** is a proposed solution for that issue, helping practitioners use current domain-specific data to augment LLM capabilities. Many LLM applications require user-specific data that is not part of the model's training set as an external source, so that we can chat with our domain specific data (documents). In RAG, external data is retrieved and then passed to the LLM when doing the generation step.

RAG should be seen as a pipeline or a system where applications typically have multiple stages: 
- **Before Query:** Loading and embedding relevant documents/information
- **During Query:** Embed query, retrieve relevant information, and respond. 

A potential diagram is shown below:

<img src="./images/naiveRAG.png" width=600>

The elementary components of a RAG pipeline for Q&A task are: 

- **Indexing**
    - Document loading & chunking
    - Document embedding (we will explain this in Section 2.1.3. below) if a dense embedding model is used: Note that using a dense retriever is optional. You can use sparse embedding using algorithms like [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) if you are looking for a retrieval method that doesn't need a neural network model for indexing. BM25 is a variant of [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). 
    - Embedding storing and indexing with a vector database (e.g. FAISS, Milvus, Chroma DB).
- **Retrieval**
    - Given a user query, relevant splits are retrieved from the vector database.
- **Augmentation & Generation**
  - Use the retrieved information to make a good prompt for a strong instruction-following LLM.

Now let's go over each component of the RAG pipeline and go through the process of building it up for our use case!

### 2.1. Indexing

In this section we will go over how to load a document, chunk it and generate embeddings from the chunke documents (from our corpus).

#### 2.1.1. Load Documents

If we want to build a **Chat With Your Data** application and want LLMs to generate relevant and specific responses for our queries, we need the model to understand our domain and provide answers from our data instead of giving broad and generalized responses. 

In doing so, we start by loading our knowledge base of data that we want to ask queries against. Langchain provides [**document loaders**](https://python.langchain.com/docs/modules/data_connection/document_loaders/) to help process out data into `Document` values containing the appropriate text and metadata. There are a variety of such document loaders that can help us load anything from [**PDFs**](https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf) to [**web pages**](https://python.langchain.com/docs/use_cases/web_scraping) to even [**video transcriptions**](https://python.langchain.com/docs/integrations/document_loaders/youtube_transcript), but for this section we will stick to a simple textfile loader for simplicity.

In [None]:
## Provide the source of the txt document
loader = TextLoader("/dli/task/dataset/gtc_sessions.txt")
docs =loader.load()

In [None]:
## This is a Document object.
type(docs[0])

In [None]:
type(docs[0].page_content)

In [None]:
## Uncomment the line below if you want to check the page content
#docs[0].page_content

In [None]:
docs[0].metadata

In [None]:
## Printing out a sample of the content
print("Number of Documents Retrieved:", len(docs))
print(f"Sample of Document 1 Content (Total Length: {len(docs[0].page_content)}):")

#### 2.1.2. Chunking

Once documents have been loaded, they are often transformed into more usable formats. One method of transformation is known as **chunking**, which breaks down large pieces of text (i.e. a long document) into smaller segments called chunks. This technique is valuable because it helps optimize the relevance of the content returned from the vector database and limit the prompt length to something the LLM can reason with. 

Remember that LLMs can struggle to reason with long inputs that push beyond their rated context lengths, and even those that are rated for long context (i.e. GPT4) incur higher costs for the extra tokens and can exhibit instruction-following degredation. With chunking, you can take steps to alleviate this by retrieving and organizing only the most relevant information. There are different chunking strategies such as fixed size, sliding window, content-based chunking that work for different scenarios, and chunks size/preprocessing can be done to optimize performance for your specific data.

LangChain provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/) out of which we will use the [**`RecursiveCharacterTextSplitter`** ](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter). This option will allow us to split our document with preference for some natural stopping points that we want our chunks to follow (as much as possible).

Important parameters to know here are `chunkSize` and `chunkOverlap`. 
- `chunk_size` arg controls the max size (in terms of number of characters) of the final documents. 
- `chunk_overlap` specifies how much overlap there should be between chunks. 

This is often helpful to make sure that the text isn't split awkwardly. 
Langchain also allows you to [**split by tokens**](https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token) using the [tiktoken](https://github.com/openai/tiktoken/tree/main)-backed `from_tiktoken_encoder()` method.

In [None]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500, 
    chunk_overlap=20, 
    separators=["\n\n\n", "\n\n", "\n", "."],
)

docs_split = text_splitter.split_documents(docs)

Alternatively, you can also use `from_huggingface_tokenizer()` with a tokenizer of choice as shown here:
```python
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-large-unsupervised')
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=500, ...)
```

You might be asking why we set `chunk_size = 500`. We set the `chunk_size` less than 512 because the retriever model we use below supports a maximum input of 512 tokens. There is not a magic `chunk_size` or `chunk_overlap` value that we can set for every document/dataset. You can do some experiments with your own custom dataset and find out what values works for your use case.

Let's check out how many chunks (doc splits) we have now.

In [None]:
len(docs_split)

In [None]:
for i in range(len(docs_split)):
    print(len(docs_split[i].page_content))

In [None]:
docs_split[0]

By uncommenting the cell below you can check the number of tokens per chunked document split.

In [None]:
# import tiktoken
# encoding = tiktoken.get_encoding("gpt2")
# for d in docs_split:
#     print("The document is %s tokens" % len(encoding.encode(d.page_content)))

#### 2.1.3. Generating Query and Document Embeddings with a Dense Embedding Model

Open-domain question answering (Q&A) relies on efficient passage/documents retrieval to select candidate contexts. [**`TF-IDF`** ](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [**`BM25`** ](https://en.wikipedia.org/wiki/Okapi_BM25) are traditional models that are known as lexical search methods and do not use dense embeddings, i.e. dense representations of the tokens or sentences. On the other hand, dense passage retrieval fulfills this gap and  it uses a dense encoder which maps any text passage to a `d` dimensional real-valued vectors. To learn more about `Dense Passage Retrieval for Open-Domain Question Answering` you can read this [paper](https://arxiv.org/pdf/2004.04906.pdf). An embedding model is a crucial component of a text retrieval system, as it transforms textual information into dense vector representations. They are usually transformer encoders that process tokens of input text (for example, question, passage) to output an embedding. 

We can use `bi-encoder` architectures and generate embeddings both for query and text passages seperately. As the image shows, we feed query and chunked passages to different towers and then after embedding representation is generated for each query and each passage, we calculate the similarity (via cosine or dot-product) between query and passage. The score between the query and most relevant passages would be higher as opposed to the pairs that are irrelevant.


<img src="./images/biencoder.png" width=400>

#### 2.1.3.1. Initialize the embedding model

[Embeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/) for documents are created by vectorizing the document text; this vectorization captures the semantic meaning of the text. This allows you to quickly and efficiently find other pieces of text that are similar. Below, we use the
`NVIDIA Retrieval QA Embedding Model`, which is part of **NVIDIA NeMo Retriever** and provides optimized, commercially-ready options for a production-ready information retrieval pipeline with enterprise support.

The main requirement when initializing an embedding model is to provide the model name. An example is `nvolveqa_40k` below. 
The NVIDIA Retrieval QA Embedding Model is a transformer encoder - a finetuned version of [E5-Large-Unsupervised](https://huggingface.co/intfloat/e5-large-unsupervised), with 24 layers and an embedding size of 1024, which is trained on private and public datasets as described in the Dataset and Training section. It supports a maximum input of 512 tokens.


For NVIDIA's embedding retriever model `nvolveqa_40k` model, you can also specify the model_type as `passage` or `query`. When doing retrieval, you will get best results if you embed the source documents with the passage type and the user queries with the query type.

If not provided, the `embed_query` method will default to the query type, and the `embed_documents` method will default to the passage type.

In [None]:
NVIDIAEmbeddings.get_available_models()

In [None]:
document_embedder = NVIDIAEmbeddings(model="ai-embed-qa-4", model_type="passage")
query_embedder = NVIDIAEmbeddings(model="ai-embed-qa-4", model_type="query")

**Document Embedding**
- **Purpose**: Tailored for longer-form or response-like content, including document chunks or paragraphs.
- **Method**: Employs `embed_documents` for batch processing of documents.
- **Role in Retrieval**: Acts as the "value," representing the searchable content within the retrieval system.
- **Usage Pattern**: Typically embedded en masse as a pre-processing step, creating a repository of document embeddings for future querying.

**Query Embedding**
- **Purpose**: Designed for embedding shorter-form or question-like material, such as a simple statement or a question.
- **Method**: Utilizes `embed_query` for embedding each query individually.
- **Role in Retrieval**: Functions as the "key," facilitating the search or query process in a document retrieval framework.
- **Usage Pattern**: Embedded dynamically, as needed, for comparison against a pre-processed collection of document embeddings.


Let's do a little exercise and generate embeddings for a set of queries and their corresponding documents/passages.

In [None]:
# Example queries and documents
queries = [
    "What's the capital of Germany?",
    "What kinds of food is Italy known for?",
    "Who is the inventor of the World Wide Web?",
    "What's the marathon distance?",
    "What's the name of NVIDIA's famous AI Conference?"
]

documents = [
    "Berlin.",
    "Italy is famous for pasta, pizza, gelato, and espresso.",
    "Berners-Lee was honoured as the inventor of the WWW during the 2012 Summer Olympics opening ceremony",
    "it is 42.195 kilometers.",
    "It is the conference for the era of AI, or you can call it GTC for short."
]

In [None]:
%%time
# Embedding the queries
q_embeddings = [query_embedder.embed_query(query) for query in queries]

# Embedding the documents
d_embeddings = document_embedder.embed_documents(documents)

In [None]:
len(d_embeddings), len(d_embeddings[0])

In [None]:
min(d_embeddings[0]), max(d_embeddings[0])

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import cosine_similarity

def plot_cross_similarity_matrix(emb1, emb2):
    # Compute the similarity matrix between embeddings1 and embeddings2
    cross_similarity_matrix = cosine_similarity(np.array(emb1), np.array(emb2))

    # Plotting the cross-similarity matrix
    plt.imshow(cross_similarity_matrix, cmap='plasma', interpolation='nearest')
    plt.colorbar()
    plt.gca().invert_yaxis()
    plt.title("Cross-Similarity Matrix")
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_cross_similarity_matrix(q_embeddings, d_embeddings)
plt.xlabel("Query Embeddings")
plt.ylabel("Document Embeddings")
plt.show()

### 2.1.4. Store Embeddings in the Vector Store [*(Retrieval)*](https://python.langchain.com/docs/modules/data_connection/)

After this toy example showing how to generate document and query embeddings, let's move on with our chunked documents.

<img src="./images/rag_vectordb.png" width=800>

As the diagram explains, once we chunk our documents, the next step is to generate embeddings for each chunk using our document embedding model and then index these embeddings using a vector database.

When a user sends in their query, the query is embedded using the query embedding model. Note that here in the bi-encoder, we use the same embedding model for both the passage and query embedding routines. Once we create the embeddings both for queries and documents, then we can find semantically similar (relevant) documents to the user's query by applying a similarity metric.

Once the document embeddings are generated, they are stored in a vector store so that at query time we can:
1) Embed the user query <br>
2) Retrieve the embedding vectors that are most similar to the embedding query using a similarity score

A vector store takes care of storing the embedded data and performing a vector search. LangChain provides support for a [selection of vector stores](https://python.langchain.com/docs/integrations/vectorstores/) among which we will choose [FAISS](https://github.com/facebookresearch/faiss) for its simplicity. [Milvus](https://milvus.io/docs/integrate_with_langchain.md) is another alternative with strong scaling features and [NVIDIA RAPIDS RAFT acceleration](https://developer.nvidia.com/blog/accelerating-vector-search-using-gpu-powered-indexes-with-rapids-raft/) for larger-scale deployments, but FAISS is a good starting point for a single-user application, 

In [None]:
embedder = NVIDIAEmbeddings(model="ai-embed-qa-4")

In [None]:
from langchain_community.vectorstores.utils import (
    DistanceStrategy,
    maximal_marginal_relevance,
)

db = FAISS.from_documents(
    docs_split, 
    embedder, 
    distance_strategy=DistanceStrategy.MAX_INNER_PRODUCT,
)

Check out whether the index was trained on our data or not.

In [None]:
db.index.is_trained

Check out the embedding dimension. `1024` comes from the embedding model.

In [None]:
db.index.d

Print out  the total number of vectors added to the index. Since we have 12 chunked doc splits, we get 12 indexed documents.

In [None]:
db.index.ntotal

There are some FAISS specific methods like `similarity_search_with_score` which returns not only the documents but also the distance or similarity score of the query to them. Let's return two most similar document chunks by setting `k=2`. 

In [None]:
query = "What presentations are given about guardrails?"
docs_and_scores = db.similarity_search_with_score(query, k=2)

In [None]:
docs_and_scores[0]

Since we use `MAX_INNER_PRODUCT` as distance strategy, we expect the higher score the better because a similarity score returned not a distance score. Note that if the embeddings are normalized the `max_inner_product` corresponds to `cosine` similarity.

### 2.2. Retrieve

Next, the query embedding is used to search a vector database that retrieves a small number of the most relevant document chunks to the user’s query.

<img src="./images/embedder.png" width=800>

Let's first convert the vectorstore into a `Retriever` which will return relevant documents for a provided query input.

In [None]:
retriever = db.as_retriever(search_kwargs={"k": 2})

relevant_docs = retriever.invoke("What presentations are given about guardrails?")

Uncomment the cell below to see what kinds of docs were retrieved. 

In [None]:
# relevant_docs[0]

### 2.3. Use an LLM to generate a response for a user query

<img src="./images/rag_llm.png" width=800>

The last step of the RAG pipeline is to generate responses back to the users. At this point, we create an expanded prompt (with the retrieved top-k chunks from retrieval step) for the LLM to generate a relevant and accurate response. A few things to consider:

- The proper and explicit human-written instruction format might affect the final response quality. It is also important to follow the recommended chat templates by the model's developers.
- Sending a large prompt to the LLM might hurt the final response accuracy due to `lost in the middle phenomenon` (see the [paper](https://cs.stanford.edu/~nfliu/papers/lost-in-the-middle.arxiv2023.pdf) for more information). The paper states that the large language models are often better at retrieving and using information at the start or end of
the input contexts.

In [None]:
prompt = ChatPromptTemplate.from_messages([
    (
        "system",
        "Answer solely based on the following context.: {context}",
    ),
    ("user", "{question}"),
])

def print_return(d):
    print(repr(d))
    return d


model = ChatNVIDIA(model="ai-mistral-7b-instruct-v2", max_tokens=1024)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

In [None]:
chain.invoke("what's the projected value of the market of Gen AI and LLMs in healthcare?")

In [None]:
chain.invoke("Who is the VP of NVIDIA Healthcare and Life Sciences?")

<br/>

## <font color="#76b900">**3. Evaluation of the RAG Pipeline**</font>

With LangChain and the out-of-the-box NVIDIA endpoints, it should be pretty easy (and maybe even *fun*) to build a proof-of-concept RAG pipeline. However, it's also important to evaluate our pipeline to make sure it's ready to go into production. Noting that the RAG pipeline consists of the `Retrieval` and `Generator` components, we need to consider what all a proper evaluation might entail?

Given a user question, the retriever finds relevant passages from a corpus (i.e., the knowledge base that we chunk and feed to the embedding model) and the language model uses these retrieved passages to generate a response. 

**Why Do We Want To Evaluate RAG?**

- Make sure the retrieved chunks and generated responses are of high quality.
- Know where the bottlenecks are and how we can improve it (retrieval? generator? both?)
- Compare different models to pick the best options for our use cases. 

What else do you think we can add into this list?

#### **What Can Be Evaluated?**

Let's consider what can be evaluated given the retrieved context, generated answer, the user query, and ground truth. The figure below gives a high level RAG Triad as presented by [TruLens](https://www.trulens.org/trulens_eval/core_concepts_rag_triad/). We added more metrics into the figure that one can consider of calculating. 

<img src="./images/rag_triad.png" width=500>


Note that this Triad does not show metrics that can be calculated using `ground_truth`. We can also calculate some metrics. e.g. `answer_similarity` using the `ground_truth` and `generated answer`.

Let's do some exercises to evaluate the **Retrieval** and **Generator** components of our RAG pipeline. 

### 3.1. Evaluating the Retrieval component

Since all retrievers in RAGs have a critical need to solve for the semantic understanding of the raw text, it is paramount to have a systematic evaluation process to choose the right `retriever` model. As we have just experienced, retrieval is the part where we return the relevant chunks for a given query and then we feed those chunks to the generator.

The best data to evaluate retrieval is your own. Ideally, you would want to build a clean and labeled evaluation dataset that best reflects what you would see in production. It is best to build a custom benchmark to evaluate the quality of different retrievers, which can vary greatly depending on the domain and task.

Without well-labeled evaluation data, many turn to popular benchmarks as proxies for evaluating retrievers. [MTEB](https://github.com/embeddings-benchmark/mteb) is a benchmark that spans 8 embedding tasks covering a total of 56 datasets and 112 languages. However, a crucial question is: `Do the datasets in these benchmarks truly represent our workload?`, because evaluating performance on irrelevant cases can lead to false confidence in your RAG pipelines.  

The type of data you encounter in production plays a crucial role in determining the relevance of academic benchmarks. We always talk about the quality/performance of the model but we underestimate the quality of the data. As we all aware of the `garbage in / garbage out` concept in our ML practices, that applies to RAG applications as well.

Additionally, the choice of metrics should be use-case dependent as well when evaluating your RAG system.

<img src="./images/metrics.png" width=500>

We can summarize some best practices when evaluating the Retrieval component as:
1. **Create your evaluation dataset:** The quality of dataset is important. The evaluation/test dataset should reflect the domain that you want to use RAG for. 
One should consider creating a dataset of realistic questions while curating the amount of ambiguity, divergence, and difficulty for your typical and edge use-cases. <br> Generating `human-annotated` data would be preferred but it also slow and resource-intensive, wereas synthetic data generation using LLMs has been used and is being researched with some promising results. 
2. **Define the accuracy metrics**: To assess the quality of our system, we need to define some metrics to either quantitatively or qualitatively measure the performance. Well known quantitative metrics can be classified as `rank-aware` and `rank-agnostic`. Note that `recall` and `precision` do not take the order of the returned documents into consideration, whereas `NDCG` and `MAP` are order-aware metrics that are suitable for document ranking.
3. **Define the evaluation tool:** In addition to writing your own custom code to evaluate the retrieval and generator, open-source libraries like [Ragas](https://github.com/explodinggradients/ragas), [TruLens](https://github.com/truera/trulens), [Phoenix Evals](https://docs.arize.com/phoenix/use-cases/rag-evaluation) can also be used to automate the process and provide useful starting points. Note that these libraries rely on a strong and consistent LLM which they use in an LLM-as-a-Judge style to evaluate your pipeline.
4. **Human-in-the-loop evaluation:** Although it is expensive and time consuming, it might still be very useful to incorporate human judgement in the evaluation process. This might be required in some edge cases where the generated results can create critical consequences, such as in the medical domain.

Although various metrics might be suitable for your specific use cases, we will focus on two relatively simple but popular metrics that you're likely to encounter in information retrieval: **`Recall`** and **`NDCG`**.

A rank-agnostic metric, `Recall`, measures the percentage of the relevant results retrieved:

$$Recall = \frac{Number\ of\ Relevant\ Items\ Retrieved}{Total\ Number\ of\ Relevant\ Items}$$

`NDCG's` advantage is that it can handle both binary and graded relevance. If you want to learn more about the NDCG mathematical formulas behind the metric, this [page](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) is a helpful resource. Furthermore this [blog post](https://www.pinecone.io/learn/offline-evaluation/) also gives useful explanations with examples.

### 3.2. Retrieval Evaluation with NVIDIA Embedding Microservice

In this section, we will perform retrieval evaluation using one of the [BEIR benchmark]((https://github.com/beir-cellar/beir) datasets, [FiQA-2018](https://sites.google.com/view/fiqa/) (Financial Opinion Mining and Q&A). [BEIR](https://github.com/beir-cellar/beir) benchmark has 18 publicly available datasets for diverse information retrieval tasks and domains. Each dataset caters toward measuring the performance of various applications of an embedding model— retrieval, clustering, summarization. Given the focus on RAG, you should consider which performance metrics and which datasets are most useful for evaluating a Question-Answering (QA) retrieval solution aligned with your use case. In the retrieval task, `NDCG@10` is the metric used across several different models. You can check out the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) of the `Retrieval task`, which is our focus here.

Beir Github [repo](https://github.com/beir-cellar/beir) provides evaluation code for the beir benchmark datasets. These datasets are in a special format, we can call it as `beir format` for now. We actually took the code [here](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/evaluation/custom/evaluate_custom_model.py) to create a custom class to run evaluation on a model with the query and passage prefixes. Note that Beir evaluation does not apply chunking since it evaluates the retrieval with already chunked corpus.

### 3.2.1. NVIDIA Embedding Microservice

NVIDIA NeMo™ microservices is a collection of containerized software to easily and rapidly build and deploy large language model (LLM) workloads for enterprise use cases. As a semantic-retrieval microservice, [NeMo Retriever](https://developer.nvidia.com/nemo-microservices-early-access) helps generative AI applications provide more accurate responses through NVIDIA-optimized algorithms. Developers using the microservices can connect their AI applications to business data wherever it resides across clouds and data centers. It adds NVIDIA-optimized RAG capabilities to AI foundries and is part of the NVIDIA AI Enterprise software platform, available in AWS Marketplace.

The `NVIDIA NeMo Retriever Embedding Microservice` brings the power of state-of-the-art text embedding to your applications, providing unmatched natural language processing and understanding capabilities. Whether you’re developing semantic search, Retrieval Augmented Generation (RAG) pipelines—or any application that needs to use text embeddings—NeMo Retriever Embedding has you covered. Built on the NVIDIA software platform incorporating CUDA, TensorRT, and Triton, NeMo Retriever Embedding brings state of the art GPU accelerated Text Embedding model serving.

**NeMo Retriever Embedding uses NVIDIA’s [TensorRT](https://developer.nvidia.com/tensorrt) built on top of the [Triton Inference Server](https://developer.nvidia.com/triton-inference-server) for optimized inference of text embedding models**.

- **Scalable Deployment:** Whether you’re catering to a few users or millions, NeMo Retriever Embedding can be scaled seamlessly to meet your demands.

- **Flexible Integration:** Easily incorporate NeMo Retriever Embedding into existing workflows and applications, thanks to the OpenAI-compliant API endpoints.

- **Secure Processing:** Your data’s privacy is paramount. NeMo Retriever Embedding ensures that all inferences are processed securely, with rigorous data protection measures in place.


Let's check if the embedding MS is running:

In [None]:
!curl -v http://embedding-ms:12345/v1/health/ready

Let's send a toy text snippet to embedding microservice as a request, and return an embedding vector for our text.

In [None]:
import base64
from typing import Dict, List, Literal, Optional
 
import numpy as np
import requests
 
def _embed(
    text: List[str],
    prefix = 'passage',
    model_name = '',
 
) -> List[np.ndarray]:
    headers: Dict[str, str] = {"Content-Type": "application/json"}
    payload = {"model": model_name, "input": text, "encoding_format": "base64"}
    payload["input_type"] = prefix
    payload["truncate"] = "END"
 
    response = requests.request("POST", "http://embedding-ms:12345/v1/embeddings/", headers=headers, json=payload, timeout=600)
 
    if response.status_code != requests.status_codes.codes.ok:
        msg = f"Error calling NeMo Retriever Embedding Microservice: \n{response.text}"
        raise Exception(msg)
 
    response_data = response.json()["data"]
 
    embeddings = [
        np.frombuffer(base64.b64decode(emb["embedding"]), dtype=np.float32)
        for emb in sorted(response_data, key=lambda o: o["index"])
    ]
 
    return embeddings


In [None]:
embeddings_toy = _embed(["I am attending NVIDIA GTC this year!"], model_name='nvolveqa')
print(embeddings_toy)

Let's check out the dimension.

In [None]:
embeddings_toy[0].shape

**Generate Embeddings for Beir dataset**

The following command calculates the embeddings using our `nvolve40k` model for `FiQA` dataset corpus and queries, and then it saves them as numpy arrays on the disk. The script below first downloads and unzips one of the smaller Beir benchmark datasets, [FiQA-2018](https://sites.google.com/view/fiqa/)  dataset, and then it runs evaluation on it using the `nvolve40k` retriever model. In order to generate embeddings, it sends requests to the running `NVIDIA Embedding Microservice` and gets the responses back.

In [None]:
!python ./beir/extract_embeddings.py

**Perform Evaluation** 

Once we generated the embeddings, now we can perform the evaluation using our custom evaluation code.

In [None]:
!python ./beir/run_beir_benchmark.py 

Since `beir` is using `pytrec_eval` to evaluate the models, you can visit [pytrec_eval library](https://github.com/cvangysel/pytrec_eval) to understand how Beir evaluation metrics are calculated in a toy example.

**Summary:**

In the section above, we evaluated the Retrieval component of the RAG pipeline using custom Beir evaluation script and generated `recall@k` and `Ndcg@k` metrics. 
If your retriever model results in higher Recall@k but lower NDCG@k that means it is selecting the right documents but not returning them in the desired order. `NDCG@k` is an important metric since we want the most relevant documents to be ranked in higher positions. To boost your NDCG metric, you can add a `reranker` model in your pipeline to rerank the fetched relevant documents. You can explore the Nvidia Reranking Microservice for that purpose.

However if the recall metric is low, that means retriever is not doing a good job finding the relevant documents. In that case, you might want to 
- Choose a stronger retriever model
- Perform a hybrid retrieval approach (for example combination of lexical + dense retriever models)
- Finetune your retriever model with your custom data or with a publicly available datasets that can be good represent of your dataset.

### 3.3. RAG Evaluation with Ragas

We learned how we can calculate offline metrics for the document retrieval step, but what about evaluating the quality (accuracy, relevancy, faithfullness, etc.) of the final response generated by the Generator? Traditional metrics, `F1 score`, `Exact match (EM)`, may not fully capture the quality of responses generated from the Generator component, i.e. LLM.

`LLMs evaluate LLMs` paradigm started to be a practice that leverages a powerful LLM to generate proxy targets based on some context. In the case of our QA bot, we can ask an LLM to generate question-answer pairs. One thing we need to pay attention is prompting the LLM models correctly when using them. LLM models can be sensitive the prompt template.

There are some open source libraries such as [Ragas](https://github.com/explodinggradients/ragas), [TruLens](https://github.com/truera/trulens), [Phoenix Evals](https://docs.arize.com/phoenix/use-cases/rag-evaluation) to automate the evaluation process of RAG systems. 

[Ragas](https://github.com/explodinggradients/ragas) is a framework that is built to help users to evaluate their RAG pipelines. In this tutorial we also demonstrate how we can generate metrics using Ragas and Langchain. Ragas provides different metrics for retrieval and generator. It calculates diffent metrics for Generator evaluation such as `AnswerSimilarity` and `Faithfullness`. Let's calculate these two metrics for our sample dataset, below and try to understand how it works under the hood.

Ragas provides metrics for both Retrieval and Generator component. The following metrics can be used for evaluating Generator component: <br>

- [Answer Similarity](https://docs.ragas.io/en/latest/concepts/metrics/semantic_similarity.html): Scores the semantic similarity of ground truth with the generated answer. <br>
- [Faithfulness](https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html): Measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. The higher the better.

**Preparing our dataset for Ragas**

Ragas requires datasets to be in a special format to evaluate them. Data preparation guide can be found [here](https://docs.ragas.io/en/stable/howtos/applications/data_preparation.html). In this exercise, we are using a sample `ragas_eval` dataset called [explodinggradients/fiqa](https://huggingface.co/datasets/explodinggradients/fiqa/viewer/ragas_eval) which is available on HuggingFace. We have already pulled in the parquet version of the dataset, so let's take a quick look: 

In [None]:
from datasets import load_dataset

dataset = load_dataset("explodinggradients/fiqa", "ragas_eval")

In [None]:
# the type of the dataset is DatasetDict
print(dataset)

For the next part, we'll need to convert it to a pandas dataframe: 

In [None]:
import pandas as pd
dataset= dataset['baseline'].to_pandas()

In [None]:
dataset.head()

In [None]:
dataset.shape

This sample dataset has 30 rows which means it consists of 30 questions and corresponding ground_truths, answer and contexts columns. This is a sample dataset from FIQA2018 dataset.

The recent versions of ragas wants the `ground_truths` column name to be `ground_truth`. Let's create the `ground_truth` column and drop the `ground_truths`.

In [None]:
dataset["ground_truth"] = dataset["ground_truths"].map(lambda x: x[0])
dataset.drop(columns=['ground_truths'], inplace=True)

In [None]:
dataset.head(2)

<font color='blue'>**Note that** </font> here we are going to use the `answer` column coming from this dataset but in real-life we should generate our own responses (answers) from our RAG pipeline and then calculate the metrics for generator component using any evaluation tool. This can be a good exercise for you for later. You can actually try to generate answers for each question in this dataset and then rerun the steps below again. 

Convert the dataset to a `Dataset` object to perform evaluation with `Ragas.`

In [None]:
from datasets import Dataset
eval_dataset = Dataset.from_pandas(dataset)

In [None]:
eval_dataset[0]

In [None]:
eval_dataset.features

**Calculate the answer similarity**

In [None]:
## Useful RAGAS Imports
from ragas import evaluate
from ragas.metrics import (
    Faithfulness,
    AnswerRelevancy,
    AnswerCorrectness,
    AnswerSimilarity
)

# Support Utilities to connect to LangChain
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

In [None]:
## Replace the default LLM (gpt-3.5-turbo-16k) with another model
llm = LangchainLLMWrapper(ChatNVIDIA(model="ai-mistral-7b-instruct-v2", max_tokens=1024))

## We can use our own embedding model for similarity calculations.
embedder = LangchainEmbeddingsWrapper(NVIDIAEmbeddings(model="ai-embed-qa-4"))

In [None]:
answer_similarity = AnswerSimilarity(llm=llm, embeddings=embedder)
ans_sim = evaluate(
    eval_dataset,
    metrics=[answer_similarity],
    llm = llm,
    embeddings=embedder
)

Print the average `answer_similarity` metric over entire queries.

In [None]:
print(ans_sim)

In [None]:
df_answer_sim = ans_sim.to_pandas()
df_answer_sim.head()

**Calculate the faithfullness**

In [None]:
faith_mtr = Faithfulness(llm = llm)

In [None]:
faith_mtr.long_form_answer_prompt.instruction = ("Create one or more statements from each sentence in the given answer."
    " Response MUST begin with square [ + curly bracket and end in curly + square ] bracket!"
)
faith_mtr.nli_statements_message.instruction = (
    "Natural language inference. Use only 'Yes' (1) and 'No' (0) as verdict."
    " Response must begin with square bracket!"
)

We can print out the few-shot examples that are already given in the Ragas source code.

In [None]:
faith_mtr.long_form_answer_prompt.__dict__

In [None]:
#faith_mtr.nli_statements_message.__dict__

In [None]:
results = evaluate(
    eval_dataset, 
    metrics=[faith_mtr],
    llm = llm,
    embeddings=embedder
)

In [None]:
results

For better readability we can convert the results to a pandas dataframe.

In [None]:
df_faith = results.to_pandas()
df_faith.head()

One further exercise you can do is to check out the samples with the low scores. You can examine if there are any noisy samples, which may indicate a need for further data curation. You can also deep dive in [the source code](https://github.com/explodinggradients/ragas/blob/main/src/ragas/metrics/_faithfulness.py) to understand how an LLM model can be used as a judge to calcuate a certain accuracy metrics.

<br/>

## <font color="#76b900">**Summary**</font>

That's it! You have done a great job by finishing this tutorial! 

Though we covered quite a few pieces of the RAG pipeline, there are still more components we could have incorporated. Some interesting further efforts might include:
- `Hybrid Search (lexical + dense retrievers)`
- `Guardrails`
- `Post-retrieval optimization (e.g. reranking)` to create an Advanced Rag pipeline. 

[NVIDIA NeMo framework](https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/) provides these functionalities to build, customize, and deploy generative AI models anywhere via its  microservices and open source libraries. Check out the resources below to learn more, and enjoy the rest of the conference! :D

<br/>

## <font color="#76b900">**Resources**</font>

[1] NVIDIA Retrieval QA Embedding on [NGC AI Playground](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/nvolve-40k).<br>
[2] Build Enterprise Retrieval-Augmented Generation Apps with NVIDIA Retrieval QA Embedding Model, NVIDIA Technical [blog post](https://developer.nvidia.com/blog/build-enterprise-retrieval-augmented-generation-apps-with-nvidia-retrieval-qa-embedding-model/).<br>
[3] Evaluating Retriever for Enterprise-Grade RAG, NVIDIA Technical [blog post](https://developer.nvidia.com/blog/evaluating-retriever-for-enterprise-grade-rag/). <br>
[4] Prompt Engineering with LLaMA-2, DLI [course](https://courses.nvidia.com/courses/course-v1:DLI+S-FX-12+V1/).<br>
[5] Rapid Application Development with Large Language Models (LLMs), DLI [course](https://courses.nvidia.com/courses/course-v1:DLI+C-FX-09+V1/).<br>
[6] Model Parallelism: Building and Deploying Large Neural Networks, DLI [course](https://www.nvidia.com/en-us/training/instructor-led-workshops/model-parallelism-build-deploy-large-neural-networks/).<br>
[7] [NVIDIA NeMo Microservices](https://developer.nvidia.com/nemo-microservices-early-access), which offers microservice spinup routines that can be deployed on local compute and function similar to AI Playground.<br> 
[8] [NVIDIA AI Workbench](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/workbench/)allows for quick and simple model customization workflows that can greatly improve RAG model components for your specific use-cases.<br>
[9] [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is the current recommended framework for deploying GPU-accelerated LLM model engines in production settings.<br>
[10] NVIDIA Generative AI Examples [Repo](https://github.com/NVIDIA/GenerativeAIExamples).<br>
[11] [NVIDIA AI Enterprise](https://www.nvidia.com/en-us/data-center/products/ai-enterprise/), The “Operating System” for Enterprise AI.