# Query Transformations: MultiQueryRetriever

Query transformations are a set of approaches focused on re-writing and / or modifying questions for retrieval. 

This notebook focuses on multi query retriever, which generate multiple queries based on the original query using LLM, then using each query to retrieve answers.
Based on the union of all the answers, a reranking model is used to choose the answer with the highest ranking score.

There is also a simple RAG with single query to compare with the multi-query retriever.

You don't need to download the models, as the LLM / embedding / reranking models are hosted at NVIDIA endpoints (https://build.nvidia.com/). You need to generate a NVIDIA API key to use the model endpoints.

You also need to use Langchain's MultiQueryRetriever. You need a Langchain API key too.


## Enviornment: Packages and API keys

In [1]:
! pip install  -qU  nest_asyncio langchain_community tiktoken langchainhub chromadb langchain langchain-nvidia-ai-endpoints 

In [2]:
import os
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = 'your Langchain API key'
# NVIDIA AI Foundation Endpoints
os.environ["NVIDIA_API_KEY"] = 'your NVIDIA API key starting with nvapi-'
os.environ['USER_AGENT'] = 'myagent'

## A Simple RAG 

In [3]:
#### INDEXING ####

# Load multiple blogs concurrently: https://python.langchain.com/v0.2/docs/integrations/document_loaders/web_base/
import bs4
import nest_asyncio
from langchain_community.document_loaders import WebBaseLoader

nest_asyncio.apply()

loader = WebBaseLoader(["https://developer.nvidia.com/blog/autoscaling-nvidia-riva-deployment-with-kubernetes-for-speech-ai-in-production/", "https://developer.nvidia.com/blog/deploying-nvidia-triton-at-scale-with-mig-and-kubernetes/", "https://developer.nvidia.com/blog/getting-kubernetes-ready-for-the-a100-gpu-with-multi-instance-gpu/"])
loader.requests_per_second = 1
docs = loader.aload()

# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, 
    chunk_overlap=50)

# Make splits
splits = text_splitter.split_documents(docs)


# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


Fetching pages: 100%|##########| 3/3 [00:00<00:00,  4.96it/s]


In [4]:
# Index
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain_community.vectorstores import Chroma

embedding_model=NVIDIAEmbeddings(model="NV-Embed-QA", truncate="NONE")
vectorstore = Chroma.from_documents(documents=splits, embedding=embedding_model)

retriever = vectorstore.as_retriever()

In [5]:
#### RETRIEVAL and GENERATION ####

# Prompt
from langchain import hub
prompt = hub.pull("rlm/rag-prompt")

# LLM
from langchain_nvidia_ai_endpoints import ChatNVIDIA
llm = ChatNVIDIA(
  model="meta/llama-3.1-8b-instruct",
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
)


# Chain
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
rag_chain.invoke("How to autoscale NVIDIA Triton to multiple GPUs using Kubernetes?")

  warn_beta(


'To autoscale NVIDIA Triton to multiple GPUs using Kubernetes, you can use a Horizontal Pod Autoscaler (HPA) to scale the number of Triton Inference Servers based on the number of inference requests. You can create a PodMonitor to collect NVIDIA Triton metrics and use PromQL to query the metrics from Prometheus. Then, you can define a custom metric in the HPA file to trigger autoscaling.'

## Multi Query Retriever


In [6]:
#follow the doc:https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/MultiQueryRetriever/
from langchain.retrievers.multi_query import MultiQueryRetriever

question = "How to autoscale NVIDIA Triton to multiple GPUs using Kubernetes?"

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=retriever, llm=llm
)


### Multiple Queries

In [7]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

retrieved_docs = retriever_from_llm.invoke(question)
len(retrieved_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['Here are three alternative versions of the user question to retrieve relevant documents from a vector database:', 'How to configure NVIDIA Triton to automatically scale across multiple GPUs in a Kubernetes environment?', 'What are the best practices for deploying NVIDIA Triton on a Kubernetes cluster with multiple GPUs, and how can I ensure efficient autoscaling?', 'How can I use Kubernetes to dynamically allocate and manage multiple NVIDIA GPUs for NVIDIA Triton inference, and what are the key considerations for autoscaling in this setup?']


11

In [8]:
retrieved_docs

[Document(metadata={'description': 'Multi-Instance GPU (MIG) is a new feature of the latest generation of NVIDIA GPUs, such as A100. It enables users to maximize the utilization of a single GPU by…', 'language': 'en-US', 'source': 'https://developer.nvidia.com/blog/getting-kubernetes-ready-for-the-a100-gpu-with-multi-instance-gpu/', 'title': 'Getting Kubernetes Ready for the NVIDIA A100 GPU with Multi-Instance GPU | NVIDIA Technical Blog'}, page_content='"nvidia.com/gpu.memory": "40537",\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 "nvidia.com/gpu.product": "A100-SXM4-40GB",\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 "nvidia.com/mig-1g.5gb.count": "2",\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 "nvidia.com/mig-1g.5gb.engines.copy": "1",\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 "nvidia.com/mig-1g.5gb.engines.decoder": "0",\n\xa0\xa0\xa0\xa0\xa0\xa0\

## Reranker
After retrieving multiple results to multiple queries, let reranker calculate the similarity scores, and pick the answer with the highest score:


In [9]:
from langchain_nvidia_ai_endpoints import NVIDIARerank
from langchain_core.documents import Document

query = "How to autoscale NVIDIA Triton to multiple GPUs using Kubernetes?"

passages = [docs.page_content for docs in retrieved_docs]

client = NVIDIARerank(
  model="nvidia/nv-rerankqa-mistral-4b-v3", 
  truncate="END",
  top_n=5
)

response = client.compress_documents(
  query=query,
  documents=[Document(page_content=passage) for passage in passages]   
)


In [10]:
print(f"Most relevant: {response[0].page_content}\n")

Most relevant: In this post, we share the following best practices:
Deploying multiple Triton Inference Servers in parallel using MIG on A100Autoscaling the number of Triton Inference Servers based on the number of inference requests using Kubernetes and Prometheus monitoring stack.Using the NGINX Plus load balancer to distribute the inference load evenly among different Triton Inference Servers.
This idea can be applied to multiple A100 or A30 GPUs on a single node or multiple nodes for autoscaling NVIDIA Triton deployment in production. For example, a DGX A100 allows up to 56 Triton Inference Servers (each A100 having up to seven servers using MIG) running on Kubernetes Pods.



In [11]:
#print(f"Least relevant: {response[-1].page_content}\n")

In [12]:
response

[Document(metadata={'relevance_score': 27.171875}, page_content='In this post, we share the following best practices:\nDeploying multiple Triton Inference Servers in parallel using MIG on A100Autoscaling the number of Triton Inference Servers based on the number of inference requests using Kubernetes and Prometheus monitoring stack.Using the NGINX Plus load balancer to distribute the inference load evenly among different Triton Inference Servers.\nThis idea can be applied to multiple A100 or A30 GPUs on a single node or multiple nodes for autoscaling NVIDIA Triton deployment in production. For example, a DGX A100 allows up to 56 Triton Inference Servers (each A100 having up to seven servers using MIG) running on Kubernetes Pods.'),
 Document(metadata={'relevance_score': 17.671875}, page_content='Figure 1. (left) Clients sending inference requests to Triton Inference Servers running on MIG devices in Kubernetes. (right) The client getting classification results and performance numbers.\

## Reference:

NVIDIA: https://build.nvidia.com/explore/discover

Langchain: https://github.com/langchain-ai/rag-from-scratch/tree/main

https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever

LangSmith: https://docs.smith.langchain.com/