# NVIDIA NIMs

The `langchain-nvidia-ai-endpoints` package contains LangChain integrations for chat models and embeddings powered by [NVIDIA AI Foundation Models](https://www.nvidia.com/en-us/ai-data-science/foundation-models/), and hosted on the [NVIDIA API Catalog](https://build.nvidia.com/).

NVIDIA AI Foundation models are community- and NVIDIA-built models that are optimized to deliver the best performance on NVIDIA-accelerated infrastructure. 
You can use the API to query live endpoints that are available on the NVIDIA API Catalog to get quick results from a DGX-hosted cloud compute environment, 
or you can download models from NVIDIA's API catalog with NVIDIA NIM, which is included with the NVIDIA AI Enterprise license. 
The ability to run models on-premises gives your enterprise ownership of your customizations and full control of your IP and AI application. 

NIM microservices are packaged as container images on a per model/model family basis 
and are distributed as NGC container images through the [NVIDIA NGC Catalog](https://catalog.ngc.nvidia.com/). 
At their core, NIM microservices are containers that provide interactive APIs for running inference on an AI Model. 

Use this documentation to learn how to install the `langchain-nvidia-ai-endpoints` package and use it to explore the `NVIDIARerank` and `NVIDIAEmbeddings` classes. This notebook demonstrates how you can use a re-ranking model to combine retrieval results and improve accuracy during retrieval of documents.

## Install the Package

In [None]:
%pip install --upgrade --quiet langchain-nvidia-ai-endpoints

## Access the NVIDIA API Catalog

To get access to the NVIDIA API Catalog, do the following:

1. Create a free account on the [NVIDIA API Catalog](https://build.nvidia.com/) and log in.
2. Click your profile icon, and then click **API Keys**. The **API Keys** page appears.
3. Click **Generate API Key**. The **Generate API Key** window appears.
4. Click **Generate Key**. You should see **API Key Granted**, and your key appears.
5. Copy and save the key as `NVIDIA_API_KEY`.
6. To verify your key, use the following code.

In [None]:
import getpass
import os

# del os.environ['NVIDIA_API_KEY']  ## delete key and reset
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

You can now use your key to access endpoints on the NVIDIA API Catalog.

## Work with the API Catalog

To test your connection to the API catalog, submit a query to the [nv-embedqa-e5-v5](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5/modelcard) model by running the following code. If you don't specify a model, the embedder uses the default model.

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings, NVIDIARerank

# embedder.get_available_models()

embedder = NVIDIAEmbeddings(model="nvidia/nv-embedqa-e5-v5")

### Truncation

Embedding and reranking models typically have a fixed context window that determines the maximum number of input tokens that can be processed. This limit can be a hard limit, equal to the model's maximum input token length, or an effective limit, beyond which the accuracy of the process decreases. Since models operate on tokens, and applications usually work with text, it can be challenging for an application to ensure that its input stays within the model's token limits. By default, an exception is thrown if the input is too large.

NVIDIA NIM microservices can truncate the input on the server side if it's too large. The `truncate` parameter accepts the following values:

- **NONE** – An exception is thrown if the input is too large. This is the default option.
- **START** – The server truncates from the start of the input, discarding tokens as necessary.
- **END** – The server truncates from the end of the input, discarding tokens as necessary.

## Combining results from multiple sources

Consider a pipeline with data from a BM25 store as well as a semantic store, such as FAISS. Each store is queried independently and returns results that the individual store considers to be highly relevant. Figuring out the overall relevance of the results is where re-ranking comes into play. We will search for information about the query `What is the meaning of life?` across a both a BM25 store and semantic store.

In [7]:
query = "What is the meaning of life?"

### BM25 relevant documents

First let's create a BM25 index that we can query. We use the [`BM25Retriever`](hhttps://python.langchain.com/v0.2/docs/integrations/retrievers/bm25/) and web search results from [DuckDuckGo](https://duckduckgo.com/).

In [None]:
%pip install --upgrade --quiet langchain-community duckduckgo-search beautifulsoup4 rank_bm25

In [None]:
%pip install --upgrade --quiet ddgs

In [20]:
from langchain_community.utilities import DuckDuckGoSearchAPIWrapper
from langchain_community.retrievers import BM25Retriever
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
import requests
from bs4 import BeautifulSoup
from typing import List

In [21]:
def build_documents(query, search_util, text_splitter, source) -> List[Document]:
    documents = []
    print(f"Building documents for {query}")
    for result in search_util(query):
        print(f"Processing {result['title']} - {result['link']}")
        try:
            text = BeautifulSoup(requests.get(result["link"]).text, "html.parser").get_text()
            for text in text_splitter.split_text(text):
                documents.append(
                    Document(
                        page_content=text,
                        metadata={
                            "title": result["title"],
                            "url": result["link"],
                            "source": source,
                        },
                    )
                )
        except Exception as e:
            print(f"Skipping due to connection error: {e}")
    print(f"Done building {len(documents)} documents")
    return documents        

In [None]:
# This might take a few minutes to run.

bm25_tool = lambda query: DuckDuckGoSearchAPIWrapper().results(query, max_results=100, source="text")
bm25_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, length_function=len)
bm25_retriever = BM25Retriever.from_documents(build_documents(query, bm25_tool, bm25_splitter, "DuckDuckGo Text"))

Return the relevant documents from the query `"What is the meaning of life?"` with the BM25 retriever.

In [None]:
bm25_retriever.k = 500
bm25_docs = bm25_retriever.invoke(query)
len(bm25_docs), bm25_docs[:5]

### Semantic documents

First let's create a FAISS index that we can query. We use web search results from [DuckDuckGo](https://duckduckgo.com/).

In [None]:
# Use the following install command that matches your hardware

# %pip install --upgrade --quiet faiss-gpu
%pip install --upgrade --quiet faiss-cpu

In [None]:
from langchain_community.vectorstores import FAISS

# De-serialization relies on loading a pickle file.
# Pickle files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.
# Only perform this with a pickle file you have created and no one else has modified.
allow_dangerous_deserialization=True

sem_tool = lambda query: DuckDuckGoSearchAPIWrapper().results(query, max_results=100, source="news")
sem_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, length_function=len)
sem_store = FAISS.from_documents(build_documents(query, sem_tool, sem_splitter, "DuckDuckGo News"), embedding=NVIDIAEmbeddings(truncate="END"))

In [29]:
sem_retriever = sem_store.as_retriever(
    search_kwargs = {"k": 500},
)

Return the relevant documents from the query `"What is the meaning of life?"` with FAISS semantic store.

In [None]:
sem_docs = sem_retriever.invoke(query)
len(sem_docs), sem_docs[:5]

### Combine and rank documents

Let's combine the BM25 and semantic search results. The resulting documents are ordered by their relevance to the query by the reranking NIM.

In [None]:
ranker = NVIDIARerank(truncate="END")

all_docs = bm25_docs + sem_docs

ranker.top_n = 5
docs = ranker.compress_documents(query=query, documents=all_docs)
docs

## Self-host with NVIDIA NIM Microservices

When you are ready to deploy your AI application, you can self-host models with NVIDIA NIM. 
For more information, refer to [NVIDIA NIM Microservices](https://www.nvidia.com/en-us/ai-data-science/products/nim-microservices/).

The following code connects to locally hosted NIM Microservices.

In [None]:
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings, NVIDIARerank

# connect to an chat NIM running at localhost:8000, specifying a specific model
llm = ChatNVIDIA(base_url="http://localhost:8000/v1", model="meta/llama3-8b-instruct")

# connect to an embedding NIM running at localhost:8080
embedder = NVIDIAEmbeddings(base_url="http://localhost:8080/v1")

# connect to a reranking NIM running at localhost:2016
ranker = NVIDIARerank(base_url="http://localhost:2016/v1")

## Related Topics

- [langchain-nvidia-ai-endpoints package ReadMe](https://github.com/langchain-ai/langchain-nvidia/blob/main/libs/ai-endpoints/README.md)
- [Overview of NVIDIA NIM for Large Language Models (LLMs)](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html)
- [Overview of NeMo Retriever Embedding NIM](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/overview.html)
- [Overview of NeMo Retriever Reranking NIM](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/overview.html)