# NVIDIA AI Foundation Endpoints 

The `langchain-nvidia-ai-endpoints` package contains LangChain integrations for chat models and embeddings powered by [NVIDIA AI Foundation Models](https://www.nvidia.com/en-us/ai-data-science/foundation-models/), and hosted on [NVIDIA API Catalog.](https://build.nvidia.com/)

NVIDIA AI Foundation models are community and NVIDIA-built models and are NVIDIA-optimized to deliver the best performance on NVIDIA accelerated infrastructure.  Using the API, you can query live endpoints available on the NVIDIA API Catalog to get quick results from a DGX-hosted cloud compute environment. All models are source-accessible and can be deployed on your own compute cluster using NVIDIA NIM which is part of NVIDIA AI Enterprise.

Models can be exported from NVIDIA’s API catalog with NVIDIA NIM, which is included with the NVIDIA AI Enterprise license, and run them on-premises, giving Enterprises ownership of their customizations and full control of their IP and AI application. NIMs are packaged as container images on a per model/model family basis and are distributed as NGC container images through the NVIDIA NGC Catalog. At their core, NIMs are containers that provide interactive APIs for running inference on an AI Model. 

This example goes over how to use LangChain to interact with the supported [NVIDIA Retrieval QA Embedding Model](https://build.nvidia.com/nvidia/embed-qa-4) for [retrieval-augmented generation](https://developer.nvidia.com/blog/build-enterprise-retrieval-augmented-generation-apps-with-nvidia-retrieval-qa-embedding-model/) via the `NVIDIAEmbeddings` class.

For more information on accessing the chat models through this api, check out the [ChatNVIDIA](https://python.langchain.com/docs/integrations/chat/nvidia_ai_endpoints/) documentation.

# NVIDIA NeMo Retriever Reranking

Reranking is a critical piece of high accuracy, efficient retrieval pipelines.

Two important use cases:
- Combining results from multiple data sources
- Enhancing accuracy for single data sources

## Installation

In [None]:
%pip install --upgrade --quiet  langchain-nvidia-ai-endpoints

## Setup

**To get started:**

1. Create a free account with [NVIDIA](https://build.nvidia.com/), which hosts NVIDIA AI Foundation models.

2. Select the `Retrieval` tab, then select your model of choice.

3. Under `Input` select the `Python` tab, and click `Get API Key`. Then click `Generate Key`.

4. Copy and save the generated key as `NVIDIA_API_KEY`. From there, you should have access to the endpoints.

In [None]:
import getpass
import os

# del os.environ['NVIDIA_API_KEY']  ## delete key and reset
if os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    print("Valid NVIDIA_API_KEY already in environment. Delete to reset")
else:
    nvapi_key = getpass.getpass("NVAPI Key (starts with nvapi-): ")
    assert nvapi_key.startswith("nvapi-"), f"{nvapi_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvapi_key

## Working with NVIDIA NIMs

When ready to deploy, you can self-host models with NVIDIA NIM—which is included with the NVIDIA AI Enterprise software license—and run them anywhere, giving you ownership of your customizations and full control of your intellectual property (IP) and AI applications.

[Learn more about NIMs](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/)



See how here [how to download and launch a NIM in your environment]()

In [None]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings, NVIDIARerank

# connect to an embedding NIM running at localhost:2016
embedder = NVIDIAEmbeddings(base_url="http://localhost:2016/v1")

# connect to a reranking NIM running at localhost:1976
ranker = NVIDIARerank(base_url="http://localhost:1976/v1")

### Combining results from multiple sources

Consider a pipeline with data from a semantic store, such as FAISS, as well as a BM25 store.

Each store is queried independently and returns results that the individual store considers to be highly relevant. Figuring out the overall relevance of the results is where reranking comes into play.

We will search for information about the query `What is the meaning of life?` across a BM25 store and semantic store.

In [None]:
query = "What is the meaning of life?"

#### BM25 relevant documents

Below we assume you have ElasticSearch running with documents stored in a `langchain-index` store.

In [None]:
%pip install --upgrade --quiet langchain-community elasticsearch

In [None]:
import elasticsearch
from langchain_community.retrievers import ElasticSearchBM25Retriever

bm25_retriever = ElasticSearchBM25Retriever(
    client=elasticsearch.Elasticsearch("http://localhost:9200"),
    index_name="langchain-index"
)

In [None]:
bm25_docs = bm25_retriever.invoke(query)

#### Semantic documents

Below we assume you have a saved FAISS index.

In [None]:
%pip install --upgrade --quiet langchain-community langchain-nvidia-ai-endpoints faiss-gpu

In [None]:
from langchain_community.vectorstores import FAISS
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings

embedder = NVIDIAEmbeddings()

# De-serialization relies on loading a pickle file.
# Pickle files can be modified to deliver a malicious payload that
# results in execution of arbitrary code on your machine.
# Only perform this with a pickle file you have created and no one
# else has modified.
allow_dangerous_deserialization=True

sem_retriever = FAISS.load_local("langchain_index", embeddings=embeddings
                                 allow_dangerous_deserialization=allow_dangerous_deserialization).as_retriever()

In [None]:
sem_docs = sem_retriever.get_relevant_documents(query)

#### Combine and rank documents

The resulting `docs` will be ordered by their relevance to the query.

In [None]:
from langchain_nvidia_ai_endpoints import NVIDIARerank

ranker = NVIDIARerank()

all_docs = bm25_docs + sem_docs

docs = ranker.compress_documents(query=query, documents=all_docs)

### Enhancing accuracy for single data sources

Semantic search with vector embeddings is an efficient way to turn a large corpus of documents into a smaller corpus of relevant documents. This is done by trading accuracy for efficiency. Reranking as a tool adds accuracy back into the search by post-processing the smaller corpus of documents. Typically, ranking on the full corpus is too slow for applications.

In [None]:
%pip install --upgrade --quiet langchain langchain-nvidia-ai-endpoints pgvector psycopg langchain-postgres

Below we assume you have Postgresql running with documents stored in a collection named `langchain-index`.

We will narrow the collection to 1,000 results and further narrow it to 10 with the reranker.

In [None]:
from langchain_nvidia_ai_endpoints import NVIDIAEmbeddings
from langchain.vectorstores.pgvector import PGVector

ranker = NVIDIARerank(top_n=10)
embedder = NVIDIAEmbeddings()

store = PGVector(embeddings=embedder,
                 collection_name="langchain-index",
                 connection="postgresql+psycopg://langchain:langchain@localhost:6024/langchain")

subset_docs = store.similarity_search(query, k=1_000)

docs = ranker.compress_documents(query=query, documents=subset_docs)