<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>
<br>

# <font color="#76b900">**Notebook 3:** Retriever and RAG Evaluation </font>

The rapid advancement of Large Language Models (LLMs) has led to a surge in their application to various natural language processing tasks, including information retrieval and question answering. However, the complexity and opacity of these models make it challenging to assess their performance and identify areas for improvement. 
<br />
<br />
Evaluating the retrieval and reranking capabilities of LLMs is crucial to ensure that they provide accurate and relevant results, particularly in high-stakes applications such as search engines, virtual assistants, and decision-support systems. By assessing the effectiveness of LLMs in retrieving and reranking relevant information, researchers and developers can identify potential biases, errors, and limitations, and develop more robust and reliable models that better serve the needs of users. Furthermore, evaluating retrieval and reranking LLMs can also inform the development of more efficient and effective training methods, leading to improved performance and faster convergence, which is essential for realizing the full potential of these powerful models.

In the following notebook, we'll be exploring how to use [NeMo Evaluator microservice](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/overview.html) to evaluate [Retriever Models](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/models/models_retriever.html) as well as [Retrieval Augmented Generation (RAG) Models](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/models/models_rag.html)!

We'll look at the following examples: 

- Retriever Model Evaluation on FiQA
- Retriever + Reranking Evaluation on FiQA
- Bonus - Retrieval Augmented Generation (RAG) Evaluation on FiQA with Ragas Metrics (HW)

## Initial Set-up and Notebook Dependencies

In order to run this notebook, the following will need to be up and running: 

- Evaluator Microservice, which can be conveniently deployed through the [Deploying with Helm](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/deploy-helm.html) guide
- NVIDIA NIM Text Embedding or a hosted Text Embedding model, `nvidia/llama-3.2-nv-embedqa-1b-v2`, which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/getting-started.html) guide
- NVIDIA NIM Text Reranking or a hosted Text Reranking model, `nvidia/llama-3.2-nv-rerankqa-1b-v2`, which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/getting-started.html) guide
- NVIDIA NIM for LLM, `meta/llama-3.1-8b-instruct` (the same one we used in Notebook 2), which can be deployed using this [Getting Started](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) guide

Once all of our services are up and running, we can install the Python `requests` library, which we will use to communicate with the Evaluator API.

In [32]:
!pip install -qU requests huggingface_hub==0.26.2

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


We'll need to provide the Evaluation API URL in the cell below.

> NOTE: Your evaluation URL will be provided as part of your deployment. 

In [33]:
EVAL_URL = "http://nemo-evaluator.local"

We'll also need to provide the endpoints for your model addresses and model names, which will be set-up as part of the deployment process for each NIM.

Below is an example of the default value for the embedding NIM:

- Embedding: 
  - EMBEDDING_URL: `http://localhost:8000/v1/embeddings`
  - EMBEDDING_MODEL_NAME: `nvidia/nv-embedqa-e5-v5`

In [34]:
!kubectl -n llama3-1-8b-instruct get svc

NAME                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
meta-llama3-1-8b-instruct   ClusterIP   10.102.195.57   <none>        8000/TCP   3d4h


In [35]:
# LLM
NIM_IP = "10.102.195.57" #FIXME with the IP generated above
LLM_URL = f"http://{NIM_IP}:8000/v1/completions"
LLM_MODEL_NAME = "meta/llama-3.1-8b-instruct"
pp(LLM_URL)

'http://10.102.195.57:8000/v1/completions'


In [2]:
# embedding
EMBEDDING_URL = "https://integrate.api.nvidia.com/v1/embeddings"
EMBEDDING_MODEL_NAME = "nvidia/llama-3.2-nv-embedqa-1b-v2"

# reranker 
RERANKER_URL = "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking"
RERANKER_MODEL_NAME = "nvidia/llama-3.2-nv-rerankqa-1b-v2"

Now we can verify our Evaluation API is up and running with the built-in health check!

In [37]:
import requests
import urllib3
from pprint import pp

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

resp = requests.get(f"{EVAL_URL}/health", verify=False)
pp(resp.status_code)

200


In [38]:
NGC_API_KEY='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'

## Retriever Model Evaluation on FiQA

For our first evaluation, we're going to evaluate our Retrieval Model (`nvidia/llama-3.2-nv-embedqa-1b-v2`) on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

The core pieces we need to provide are: 

- `top_k`, how many documents to retriever through our retriever model
- `query_embedding_url`, the address of your hosted `nvidia/llama-3.2-nv-embedqa-1b-v2` endpoint.
- `query_embedding_model`, this will be `nvidia/llama-3.2-nv-embedqa-1b-v2` if you're following the notebook exactly.
- `index_embedding_url`, which will mirror the `query_embedding_url` assuming that you're using the same NIM deployment for both Query Embedding and Index embedding.
- `index_embedding_model`, this will mirror the `query_embedding_model` assuming that you're using the same NIM deployment for both Query Embedding and Index embedding.

> NOTE: While it's possible to use different NIM *deployments* for Query/Index Embedding - you will need to ensure the underlying model is the same between both.

We'll also want to ensure we've set-up our evaluations correctly by following the available [documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations/evaluations_retriever.html) for Retriever evaluations.


We can set up the evalutor API end points - 

In [25]:
target_endpoint = f"{EVAL_URL}/v1/evaluation/targets"
eval_config_endpoint = f"{EVAL_URL}/v1/evaluation/configs"
job_endpoint = f"{EVAL_URL}/v1/evaluation/jobs"

In [26]:
retriever_target_config = {
 "type": "retriever",
 "retriever": {
   "pipeline": {
     "query_embedding_model": {
       "api_endpoint": {
           "url": EMBEDDING_URL,
           "model_id": EMBEDDING_MODEL_NAME,
           "api_key": NGC_API_KEY
       }
     },
     "index_embedding_model": {
       "api_endpoint": {
           "url": EMBEDDING_URL,
           "model_id": EMBEDDING_MODEL_NAME,
           "api_key": NGC_API_KEY
       }
     },
     "top_k": 5
   }
 }
}

We can first take the embedding model for a quick test-drive 

In [3]:
import requests
import json

EMBEDDING_URL = 'http://0.0.0.0:8000'

headers = {
    # 'Authorization': f'Bearer {NGC_API_KEY}',
    'Accept': 'application/json',
    'Content-Type': 'application/json'
}

# Data payload as a Python dictionary
data = {
    "input": ["Hello NGC 2025"],
    "model": EMBEDDING_MODEL_NAME,
    "input_type": "query"
}

# Making the POST request
response = requests.post(EMBEDDING_URL, headers=headers, json=data)

# Printing the response
print(response.text)

ConnectionError: HTTPConnectionPool(host='0.0.0.0', port=8000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f8c1eeec610>: Failed to establish a new connection: [Errno 111] Connection refused'))

Then we are clear to fire off the request!

In [21]:
retriever_response = requests.post(
    target_endpoint,
    json=retriever_target_config,
    headers={'accept': 'application/json'},
    verify=False)

retriever_target_name = retriever_response.json()["name"]
print(retriever_target_name)

eval-target-L5AUA4EGoN8AKtMjS7yfMp


We'll capture our target ID for the coming steps - but with this step we have created our target and are ready to create an evaluation configuration!

In [187]:
retriever_target_namespace = retriever_response.json()["namespace"]
print(f"Target Name: {retriever_target_name}, Target Namespace: {retriever_target_namespace}")

Target Name: eval-target-LyUYbe3QD3r1Yn6MBgZYmT, Target Namespace: -


Now we can grab our evaluation configuration.

In [22]:
retriever_eval_config = {
 "type": "retriever",
 "tasks": [
   {
     "type": "beir",
     "dataset": {
       "format": "beir",
       "files_url": "fiqa"
     },
     "metrics": [
       {
         "name": "recall_5",
       },
       {
         "name": "ndcg_cut_5",
       },
       {
         "name": "recall_10",
       },
       {
         "name": "ndcg_cut_10",
       }
     ]
   }
 ]
}

Now that we have our payload - we can send it to our Nemo Evaluator endpoint.

In [23]:
retriever_eval_response = requests.post(
    eval_config_endpoint,
    json=retriever_eval_config,
    headers={'accept': 'application/json'},
    verify=False)

retriever_config_name = retriever_eval_response.json()["name"]
print(retriever_config_name)

eval-config-CbrMyHq1VAwcv3MAykACAS


Let's again capture our evaluation config for use later.

In [24]:
retriever_config_namespace = retriever_eval_response.json()["namespace"]
print(f"Config Name: {retriever_config_name}, Config Namespace: {retriever_config_namespace}")

Config Name: eval-config-CbrMyHq1VAwcv3MAykACAS, Config Namespace: default


### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [28]:
job_config = {
    "target": f"default/{retriever_target_name}",
    "config": f"default/{retriever_config_name}",
    "tags": [
        "embedding-fiqa"
    ]
}

All that's left to do is fire off our job!

In [29]:
retriever_job_response = requests.post(
    job_endpoint,
    json=job_config,
    headers={'accept': 'application/json'},
    verify=False)

retriever_job_id = retriever_job_response.json()["id"]
print(f"Job ID: {retriever_job_id}")

Job ID: eval-JaSyG6LemDDz4PmUHPY56A


#### Monitoring

We can monitor the status of our job through the following endpoint.

In [None]:
status = "initializing"

while status == "running" or status == "initializing":
    sleep(120)
    resp = requests.get(f"{EVAL_URL}/v1/evaluation/jobs/{retriever_job_id}")
    status = resp.json()["status"]["status"]
pp(resp.json())

We can check on the status of our evaluation in the cell below. 

> NOTE: When the evaluation `status` becomes `succeeded`, the `evaluation_results` field will become populated.

In [195]:
import requests

# The URL we're sending the GET request to
url = f"{EVAL_URL}/v1/evaluation/jobs/-/{retriever_job_id}/download-results"
filename = f"retriever_{retriever_job_id}.zip"
# Additional headers being sent with the request
headers = {
    'accept': 'application/json',
}

# Since you're using -k in curl, it allows connections to SSL sites without certificates.
# In requests, you can achieve this by setting verify to False.
# WARNING: This is insecure and should only be used with caution.
response = requests.get(url, headers=headers, verify=False)

# Check if the request was successful
if response.status_code == 200:
    # Write the content of the response to a file
    with open(filename, 'wb') as file:
        file.write(response.content)
    print("Downloaded the file successfully.")
else:
    print(f"Failed to download the file. Status code: {response.status_code}")

Downloaded the file successfully.


## Retriever + Reranking Evaluation on FiQA

For our second evaluation, we're going to evaluate our Retrieval Model (`nvidia/llama-3.2-nv-embedqa-1b-v2`) on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

Instead of simply using a Retriever model, however, this example will also leverage a Reranking model (`nvidia/llama-3.2-nv-rerankqa-1b-v2`) to rerank the retrieved results.

We'll rerun the same evaluation configuration as we did above - with a few extra parameters in our `retriever` configuration:

- `ranker_url`, which will point to our reranking model
- `ranker_model`, which will contain the name of our reranking model

In [196]:
import requests
import json

headers = {
    'Authorization': f'Bearer {NGC_API_KEY}',
    'Accept': 'application/json',
    'Content-Type': 'application/json'
}

# Data payload as a Python dictionary
data = {
    "model": RERANKER_MODEL_NAME,
    "query": {"text": "which way did the traveler go?"},
    "passages": [
        {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
        {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
        {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
        {"text": "i shall be telling this with a sigh somewhere ages and ages hense: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
    ],
    "truncate": "END"
}

# Making the POST request
response = requests.post(RERANKER_URL, headers=headers, json=data)

# Printing the response
print(response.text)

{"rankings":[{"index":0,"logit":1.564453125},{"index":3,"logit":-0.497802734375},{"index":2,"logit":-3.697265625},{"index":1,"logit":-6.2578125}],"usage":{"prompt_tokens":220,"total_tokens":220}}


In [197]:
reranker_target_config = {
 "type": "retriever",
 "retriever": {
   "pipeline": {
     "query_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME,
           "api_key": NGC_API_KEY
       }
     },
     "index_embedding_model": {
       "api_endpoint": {
         "url": EMBEDDING_URL,
         "model_id": EMBEDDING_MODEL_NAME,
        "api_key": NGC_API_KEY
       }
     },
     "reranker_model": {
       "api_endpoint": {
         "url": RERANKER_URL,
         "model_id":RERANKER_MODEL_NAME,
        "api_key": NGC_API_KEY
       }
     },
     "top_k": 10
   }
 }
}

Then we are clear to fire off the request!

In [198]:
reranker_response = requests.post(
    target_endpoint,
    json=reranker_target_config,
    headers={'accept': 'application/json'},
    verify=False)

reranker_target_name = reranker_response.json()["name"]
print(reranker_target_name)

eval-target-MWrthRkE2tkJ8jC1WgsZ7x


We'll capture our target ID for the coming steps - but with this step we have created our target and are ready to create an evaluation configuration!

In [199]:
reranker_target_namespace = reranker_response.json()["namespace"]
print(f"Target Name: {reranker_target_name}, Target Namespace: {reranker_target_namespace}")

Target Name: eval-target-MWrthRkE2tkJ8jC1WgsZ7x, Target Namespace: -


Now that we have our payload - we can send it to our Nemo Evaluator endpoint.

> NOTE: Notice how we don't have to re-create our evaluation configuration since we already created it for the Embedding model evaluation!

### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [200]:
reranker_job_config = {
    "target": f"default/{reranker_target_name}",
    "config": f"default/{retriever_config_name}",
    "tags": [
        "embedding-rerank-fiqa"
    ]
}

All that's left to do is fire off our job!

In [201]:
reranker_job_response = requests.post(
    job_endpoint,
    json=reranker_job_config,
    headers={'accept': 'application/json'},
    verify=False)

reranker_job_id = reranker_job_response.json()["id"]
print(f"Job ID: {reranker_job_id}")

Job ID: eval-Dw5SrTXA3wQsuwiMgiw6gX


#### Monitoring

We can monitor the status of our job through the following endpoint.

In [203]:
status = "initializing"

while status == "running" or status == "initializing":
    sleep(120)
    resp = requests.get(f"{EVAL_URL}/v1/evaluation/jobs/{reranker_job_id}")
    status = resp.json()["status"]["status"]
pp(resp.json())

{'namespace': '-',
 'name': 'eval-Dw5SrTXA3wQsuwiMgiw6gX',
 'tags': ['embedding-rerank-fiqa'],
 'id': 'eval-Dw5SrTXA3wQsuwiMgiw6gX',
 'target': {'namespace': '-',
            'name': 'eval-target-MWrthRkE2tkJ8jC1WgsZ7x',
            'type': 'retriever',
            'model': None,
            'retriever': {'pipeline': {'query_embedding_model': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',
                                                                                  'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',
                                                                                  'api_key': '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'},
                                                                 'cached_outputs': None},
                                       'index_embedding_model': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',
                                                    

We can check on the status of our evaluation in the cell below. 

> NOTE: When the evaluation `status` becomes `succeeded`, the `evaluation_results` field will become populated.

Once it's done - let's look at the full results!

In [204]:
print(reranker_monitoring_response)

{'namespace': '-', 'name': 'eval-Dw5SrTXA3wQsuwiMgiw6gX', 'tags': ['embedding-rerank-fiqa'], 'id': 'eval-Dw5SrTXA3wQsuwiMgiw6gX', 'target': {'namespace': '-', 'name': 'eval-target-MWrthRkE2tkJ8jC1WgsZ7x', 'type': 'retriever', 'model': None, 'retriever': {'pipeline': {'query_embedding_model': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings', 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2', 'api_key': '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'}, 'cached_outputs': None}, 'index_embedding_model': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings', 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2', 'api_key': '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'}, 'cached_outputs': None}, 'reranker_model': {'api_endpoint': {'url': 'https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking', 'model_id': 'nvidia/llama-3.2-nv-rerankqa-1b-v2', 'api_key': 'nvapi-FmXFR9EGKIkiv7

The `evaluation_results` field will contain our `metrics` along with their name, and their score.

In [205]:
import requests

# The URL we're sending the GET request to
url = f"{EVAL_URL}/v1/evaluation/jobs/-/{reranker_job_id}/download-results"
filename = f"reranker_{reranker_job_id}.zip"
# Additional headers being sent with the request
headers = {
    'accept': 'application/json',
}

# Since you're using -k in curl, it allows connections to SSL sites without certificates.
# In requests, you can achieve this by setting verify to False.
# WARNING: This is insecure and should only be used with caution.
response = requests.get(url, headers=headers, verify=False)

# Check if the request was successful
if response.status_code == 200:
    # Write the content of the response to a file
    with open(filename, 'wb') as file:
        file.write(response.content)
    print("Downloaded the file successfully.")
else:
    print(f"Failed to download the file. Status code: {response.status_code}")

Downloaded the file successfully.


## HW 1 - Retrieval Augmented Generation (RAG) Evaluation on FIQA with Ragas Metrics

With the most recent release of NeMo Evaluator microservice, not only can we evaluate Retrievers and Rerankers - we can also Evaluate RAG!

Once again, we're going to evaluate on the [FiQA](https://sites.google.com/view/fiqa/) retrieval task as part of the [BeIR](https://github.com/beir-cellar/beir) benchmark.

We're also going to evaluate our RAG pipeline on the [Ragas](https://docs.ragas.io/en/stable/howtos/index.html) metrics ["Faithfulness"](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html). This can be done by extending our evaluation configuration in the following ways:

1. We can create the model type `rag`, and provide our `retriever` configuration we used in the first evaluation.
2. We need to provide a `context_ordering` parameter, in this case we'll use `desc` which will order our context in descending score.
3. We need to provide a "generator" (LLM) that can be used to generate responses based on the retrieved context!

We'll also need to add in a number of `judge_` parameters to help calculate the Faithfulness metric.

Let's look at an example evaluation configuration below:

In [None]:
import requests
import json

headers = {
    'Authorization': f'Bearer {NGC_API_KEY}',
    'Accept': 'application/json',
    'Content-Type': 'application/json'
}

# Data payload as a Python dictionary
data = {
    "messages": [
        {
            "role": "user",
            "content": "Write a limerick about the wonders of GPU computing."
        }
    ],
    "model": LLM_MODEL_NAME,
    "top_p": 0.7,
    "max_tokens": 1024,
    "seed": 42,
    "stream": False,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "temperature": 0.2
}

# Making the POST request
response = requests.post(LLM_URL, headers=headers, json=data)

# Printing the response
print(response.text)

In [207]:
rag_target_config = {
 "type": "rag",
 "rag": {
   "pipeline": {
     "retriever": {
       "pipeline": {
         "query_embedding_model": {
           "api_endpoint": {
               "url": EMBEDDING_URL,
               "model_id": EMBEDDING_MODEL_NAME,
               "api_key": NGC_API_KEY
               
           }
         },
         "index_embedding_model": {
           "api_endpoint": {
               "url": EMBEDDING_URL,
               "model_id": EMBEDDING_MODEL_NAME,
               "api_key": NGC_API_KEY
           }
         }
       }
     },
     "model": {
       "api_endpoint": {
           "url": LLM_URL,
           "model_id": LLM_MODEL_NAME,
           "api_key": NGC_API_KEY
       }
     }
   }
 }
}

We'll want to point our request at the `v1/evaluation/targets` endpoint to create the target.

Then we are clear to fire off the request!

In [208]:
rag_response = requests.post(
    target_endpoint,
    json=rag_target_config,
    headers={'accept': 'application/json'},
    verify=False)

rag_target_name = rag_response.json()["name"]
print(rag_target_name)

eval-target-3HUyYMYJ4F1Taskr6sbfdc


We'll capture our target ID for the coming steps - but with this step we have created our target and are ready to create an evaluation configuration!

In [209]:
rag_target_namespace = rag_response.json()["namespace"]
print(f"Target Name: {rag_target_name}, Target Namespace: {rag_target_namespace}")

Target Name: eval-target-3HUyYMYJ4F1Taskr6sbfdc, Target Namespace: -


Now we can grab our evaluation configuration.

In [210]:
rag_eval_config = {
 "type": "rag",
 "tasks": [
   {
     "type": "beir",
     "params": {
       "judge_llm": {
         "api_endpoint": {
           "url": LLM_URL,
           "model_id": LLM_MODEL_NAME,
             "api_key": NGC_API_KEY
         }
       },
       "judge_embeddings": {
         "api_endpoint": {
           "url": EMBEDDING_URL,
           "model_id": EMBEDDING_MODEL_NAME,
             "api_key": NGC_API_KEY
         }
       },
       "judge_timeout": 300,
       "judge_max_retries": 5,
       "judge_max_workers": 16
     },
     "dataset": {
       "files_url": "nfcorpus",
       "format": "beir"
     },
     "metrics": [
       {
         "name": "recall_5"
       },
       {
         "name": "ndcg_cut_5"
       },
       {
         "name": "recall_10"
       },
       {
         "name": "ndcg_cut_10"
       },
       {
         "name": "faithfulness"
       }
     ]
   }
 ]
}


Now that we have our payload - we can send it to our Nemo Evaluator endpoint.

We'll set up our Evaluator endpoint URL...

In [211]:
rag_eval_response = requests.post(
    eval_config_endpoint,
    json=rag_eval_config,
    headers={'accept': 'application/json'},
    verify=False)

rag_config_name = rag_eval_response.json()["name"]
print(rag_config_name)

eval-config-DPAvbz89MTTqjX4G8yQ9P6


Let's again capture our evaluation config for use later.

In [212]:
rag_config_namespace = rag_eval_response.json()["namespace"]
print(f"Config Name: {rag_config_name}, Config Namespace: {rag_config_namespace}")

Config Name: eval-config-DPAvbz89MTTqjX4G8yQ9P6, Config Namespace: -


### Running an Evaluation Job

Now that we have our `target_id` and `config_id` -  we have everything we need to run an evaluation.

Let's see the process to create and run a job! 

First things first, we need to create a job payload to send to our endpoint - this will point to our target, and our configuration.

In [213]:
rag_job_config = {
    "target": f"default/{rag_target_name},
    "config": f"ddefault/{rag_config_name},
    "tags": [
        "rag-eval"
    ]
}

All that's left to do is fire off our job!

In [214]:
rag_job_response = requests.post(
    job_endpoint,
    json=rag_job_config,
    headers={'accept': 'application/json'},
    verify=False)

rag_job_id = rag_job_response.json()["id"]
print(f"Job ID: {rag_job_id}")

Job ID: eval-75pK5iNYWdPNnyaer2yVGB


#### Monitoring

We can monitor the status of our job through the following endpoint.

In [216]:
status = "initializing"

while status == "running" or status == "initializing":
    sleep(120)
    resp = requests.get(f"{EVAL_URL}/v1/evaluation/jobs/{rag_job_id}")
    status = resp.json()["status"]["status"]
pp(resp.json())

{'namespace': '-',
 'name': 'eval-75pK5iNYWdPNnyaer2yVGB',
 'tags': ['rag-eval'],
 'id': 'eval-75pK5iNYWdPNnyaer2yVGB',
 'target': {'namespace': '-',
            'name': 'eval-target-3HUyYMYJ4F1Taskr6sbfdc',
            'type': 'rag',
            'model': None,
            'retriever': None,
            'rag': {'pipeline': {'retriever': {'pipeline': {'query_embedding_model': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',
                                                                                                       'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',
                                                                                                       'api_key': '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'},
                                                                                      'cached_outputs': None},
                                                            'index_embedding_model': {'api_endp

In [217]:
import requests

# The URL we're sending the GET request to
url = f"{EVAL_URL}/v1/evaluation/jobs/-/{rag_job_id}/download-results"
filename = f"rag_{rag_job_id}.zip"
# Additional headers being sent with the request
headers = {
    'accept': 'application/json',
}

# Since you're using -k in curl, it allows connections to SSL sites without certificates.
# In requests, you can achieve this by setting verify to False.
# WARNING: This is insecure and should only be used with caution.
response = requests.get(url, headers=headers, verify=False)

# Check if the request was successful
if response.status_code == 200:
    # Write the content of the response to a file
    with open(filename, 'wb') as file:
        file.write(response.content)
    print("Downloaded the file successfully.")
else:
    print(f"Failed to download the file. Status code: {response.status_code}")

Downloaded the file successfully.


The above evaluations provide you with the initial tools to understand the quality of your RAG pipeline. You can try and modify the models to see how it impacts the quality of the results. Additionally, you can also build a custom dataset and use it instead of the default datasets used in the above evaluation tasks.