# NVIDIA Guardrails with RAG for fact-checking

In this notebook, we'll demonstrate how to integrate a RAG (Retrieval-Augmented Generation) system using NVIDIA AI Endpoints with NeMo Guardrails for enhanced text generation and fact-checking. While there are many examples of integrating NeMo Guardrails and RAG pipelines using LangChain, this demo showcases how Guardrails can be combined with any RAG system, independent of the `RunnableRails` interface.

The workflow is as follows: First, a [vector store](config/config.py#L155) is built using FAISS and NVIDIA NIMs as the [embedding model](./config/config.yml#L13). The content for the vector store is sourced from the NVIDIA Triton documentation webpages, though the code can easily be modified to use any other data source or vector store. Next, NeMo Guardrails is configured to use the initialized vector store for generating bot messages and performing fact-checking.

Let's start by installing the required libraries and importing the necessary packages for this notebook.

In [None]:
!pip install faiss-gpu==1.7.2 # replace with faiss-cpu if you don't have a gpu
!pip install langchain==0.2.15
!pip install nemoguardrails==0.9.1.1
!pip install langchain-nvidia-ai-endpoints==0.1.2
!pip install bs4

In [8]:
import getpass
import os

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvidia_api_key.startswith(
        "nvapi-"
    ), f"{nvidia_api_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key

As stated in the [documentation](https://docs.nvidia.com/nemo/guardrails/introduction.html#guardrails-configuration), the `config.py` file is used for custom initialization of the vector store. Specifically, the [`init` function](./config/config.py#L197) is invoked by the `LLMRails` constructor and serves as the entry point for initializing the vector store. It also instructs Guardrails to use the knowledge base as an additional data source when generating bot messages.

By providing a custom implementation of the [`retrieve_relevant_chunks` function](./config/config.py#L171), we enable NeMo Guardrails to retrieve external knowledge from the vector store and inject it into prompts. For example, this is done in the [`generate_bot_message` prompt](./config/prompts.yml#L82), where the `relevant_chunks` variable is added at template time. Note that the `retriever` argument in the retrieve_relevant_chunks function is automatically passed by NeMo Guardrails as it is registered as an [action parameter](./config/config.py#L211).

Finally, to activate the `fact-checking` rails, the `self check facts` rule must be added as an output rail in the standard [config.yml](config/config.yml#L21) file required by NeMo Guardrails.

In [5]:
from nemoguardrails import LLMRails, RailsConfig

config = RailsConfig.from_path("guardrails/rag-fact-check/config")
rails = LLMRails(config, verbose=True)



Storing embeddings to guardrails/rag-fact-check/config/../data/nv_embedding
Generated embedding successfully




In [6]:
import nest_asyncio

nest_asyncio.apply()

In [9]:
outputs = rails.generate(
    messages=[
        {
            "role": "user",
            "content": "how can I specify the location of the model repository in triton-inference-server",
        }
    ],
    options={
        "output_vars": True,
        "llm_output": True,
        "log": {
            "activated_rails": True,
            "llm_calls": True,
            "internal_events": True,
            "colang_history": True,
        },
    },
)
print(outputs.response[0]["content"])

You can specify the location of the model repository in Triton Inference Server using the --model-repository option when starting the server. This option allows you to specify one or more model repositories that the server will serve models from. You can also modify the models being served while the server is running using Model Management. For more information, please refer to the Triton Inference Server documentation, specifically the Architecture section: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html


As outlined in the main documentation, the purpose of the [Fact-Checking Rails](https://docs.nvidia.com/nemo/guardrails/user_guides/guardrails-library.html#fact-checking) is to verify that the messages generated by the bot comply with the knowledge base. If the bot's messages contain information (correct or incorrect) that is not present in the knowledge base, the self check fact rails will prevent the output from being displayed to the user.

In the following example, the bot message, although correct, was blocked because the knowledge base only contains information about NVIDIA Triton, while the question/answer was related to CUDA.

In [10]:
outputs = rails.generate(
    messages=[
        {
            "role": "user",
            "content": "how do I build custom cuda kernel ?",
        }
    ],
    options={
        "output_vars": True,
        "llm_output": True,
        "log": {
            "activated_rails": True,
            "llm_calls": True,
            "internal_events": True,
            "colang_history": True,
        },
    },
)
print(outputs.response[0]["content"])

I'm sorry, I can't respond to that.
