# Playground & e2e Evaluation for Redbox RAG chat  <a class="anchor" id="title"></a>

## Table of Contents <a class="anchor" id="toc"></a>
* [Overview](#one-section)
* [Metrics](#two-section)
    - [Fathfulness]()
    - [Answer Relevancy]()
    - [Hallucination]()
* [Evaluation Dataset](#three-section)
* [Evaluation Workflow in this Notebook](#four-section)
* [Prompt Playground](#five-section)
    - [RAG prompts](#six-section)
* [Generate RAG responses and append them to evaluation dataset](#seven-section)

* [eight](#eight-section)


## Overview <a class="anchor" id="one-section"></a>

When it comes to optimising the generation part of our RAG system, the only thing that we can modify are the `RAG prompts` that are passed with context to the LLM. Other components certainly play into the overall generation evaluation score, such as is the retrieved context of high-quality, but the levers to change these other components are further upstream in the RAG pipeline, and evaluated in Retrieval Evaluation and e2d Evaluation notebooks. These other components are also slower to change compared to prompts, which are just natural language!

We want to avoid using the /chat/rag endpoint for quick experimentation with `RAG prompts`, as the need to rebuild the core_api docker image, start and stop container etc will really slow down development --> changing prompts is very quick to do, so we want quick evaluation of how these prompt changes. 

For this reason, the /chat/rag endpoint function is in this notebook, and prompts can be changed in a single place, followed by much quicker feedback. If your prompt experiments look good, i.e. they improve generation evalution metrics, then you can consider making these changes in the `core_api` service. Information on where to make the corresponding changesin the the `core_api` service are at the bottom of this notebook. Once you make changes in `core_api` and rebuild, these changes will be reflected in the deployed /chat/rag endpoint.

We will evaluate RAG generation using metrics described in the next section.

[Back to top](#title)

---------------

## Metrics <a class="anchor" id="two-section"></a>

#TODO: Add retrieval metrics too

Start by using 3 DeepEval metrics:
- Faithfulness
- Answer Relevancy **(what are we taking as 'input'? Raw question or refined question?)**
- Hallucination

### Faithfulness

The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the `actual_output` factually aligns with the contents of your `retrieval_context`. `deepeval`'s faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

##### Required Arguments
To use the `FaithfulnessMetric`, you need to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`
- `retrieval_context`

[Back to top](#title)

### Answer Relevancy
The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output of your LLM application is compared to the provided `input`. `deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

##### Required Arguments
To use the AnswerRelevancyMetric, you'll have to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`

[Back to top](#title)

### Hallucination
The hallucination metric determines whether your LLM generates factually correct information by comparing the `actual_output` to the provided `context`.

##### Required Arguments
To use the HallucinationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`
- `retrieval_context`

[Back to top](#title)

-----------------

## Notebook Setup <a class="anchor" id="five-section"></a>

Some basic setup for RAG chat function to work during experimentation

In [1]:
# Add autoreloatd
%reload_ext autoreload
%autoreload 2

# Check this autoreload works in vscode

In [2]:
import os
import logging
from langchain.prompts.prompt import PromptTemplate

**Mock HTTP Authorization Credentials, used by the RAG chat function**

In [4]:
from core_api.src.auth import get_user_uuid
from fastapi.security import HTTPAuthorizationCredentials
from unittest.mock import Mock

In [None]:
# Mock HTTP Authorization Credentials
credentials = Mock(spec=HTTPAuthorizationCredentials)
credentials.credentials = "mock_token"

**RAG chat function imports**

In [6]:
from langchain.chains.llm import LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain_community.chat_models import ChatLiteLLM
from langchain_core.prompts import ChatPromptTemplate

[Back to top](#title)

---------------

## Evaluation Dataset <a class="anchor" id="three-section"></a>

### Load dataset

In [None]:
# #TODO: Load evaluation dataset for generation evaluation
# from deepeval.dataset import EvaluationDataset

# dataset = EvaluationDataset()
# dataset.add_test_cases_from_json_file(
#     # file_path is the absolute path to you .json file
#     file_path="example.json",
#     input_key_name="query",
#     actual_output_key_name="actual_output",
#     expected_output_key_name="expected_output",
#     context_key_name="context",
#     retrieval_context_key_name="retrieval_context",
# )

[Back to top](#title)

------

## Evaluation Workflow in this Notebook <a class="anchor" id="four-section"></a>

Follow these steps to run an experiment:
1. Make experimental changes to [`RAG prompts`]() - these will be used by the /chat/rag function
2. (Optional) Make experimental changes to the [/chat/rag function]() 
3. Pass the evaluation dataset through the /chat/rag function to general `actual_output` and append these to the evaluation dataset
4. Run evaluations on dataset to calcuate generation evaluation metrics

[Back to top](#title)

-------------

## Document Processing

#### Documents for Evaluation

We have a list of docs to use for evaluation. These should continue to be built upon to ensure we cover all different types of documents and content that users of Redbox may require, to ensure we have good evalution coverage.

#### Run MinIO container
The file chunker is set up to get files from an s3 bucket. So in order to use the existing `core_api` parsing code as is, we must go through the process of running a local MinIO container and uploading files to it.

Follow the following steps
1. Ensure you have a docker runtime running
2. In evaluation folder create `minio/data` directories
3. Get the ABSOLUTE path for this minio/data directory on your system - and replace it in the penultimate -v line below
2. Run a MinIO container by running the cell below in a terminal window (from https://min.io/docs/minio/container/index.html)

In [None]:
"""
docker run \
   -p 9000:9000 \
   -p 9001:9001 \
   --user $(id -u):$(id -g) \
   --name minio \
   -e "MINIO_ROOT_USER=minioadmin" \
   -e "MINIO_ROOT_PASSWORD=minioadmin" \
   -v /Users/andy/tw/gen-ai/redbox-copilot/notebook/evaluation/minio/data:/data \
   quay.io/minio/minio server /data --console-address ":9001"
"""

If you want, go to browser `localhost:9000` and login with the username and password set above

#### Run Elasticsearch container

In [None]:
!docker pull docker.elastic.co/elasticsearch/elasticsearch:8.11.0

Run an elasticsearch container by running the cell below in a terminal window

In [None]:
"""
docker run -p 9200:9200 -p 9300:9300 --name elasticsearch \
-e "discovery.type=single-node" \
docker.elastic.co/elasticsearch/elasticsearch:8.11.0

"""

In [None]:
#TODO: Can we mount a volumne to persist loaded documents, as we built up test set?

#### Document Upload

**Import Pydantic chat models, used by the RAG chat function**

In [9]:
from minio import Minio
from minio.error import S3Error

In [11]:
# Used by the file functions
from redbox.models import Chunk, File, FileStatus, Settings
from redbox.storage import ElasticsearchStorageHandler
from redbox.model_db import SentenceTransformerDB
from redbox.parsing.file_chunker import FileChunker
# Used by the RAG chat function
from redbox.models.chat import ChatMessage, ChatRequest, ChatResponse, SourceDocument

In [12]:
# === Logging ===

logging.basicConfig(level=logging.INFO)
log = logging.getLogger()

env = Settings()

In [13]:
# === Storage ===
# Initialize an ElasticsearchStorageHandler
# es = env.elasticsearch_client()
from elasticsearch import Elasticsearch
es = Elasticsearch(
                hosts=[
                    {
                        "host": "localhost",
                        "port": 9200,
                        "scheme": "http",
                    }
                ],
                basic_auth=("elastic", "redboxpass"),
            )
storage_handler = ElasticsearchStorageHandler(es_client=es, root_index="redbox-data")

# Initialize a FileChunker
chunker = FileChunker()

**Change needs to be made in chunker.py too**

In [14]:
# === Object Store ===
import boto3
def s3_client():
    client = boto3.client(
        "s3",
        aws_access_key_id="",
        aws_secret_access_key="",
        endpoint_url=f"http://localhost:9000",
    )
    return client

s3 = s3_client()

In [None]:
# from minio import Minio
# from minio.error import S3Error

# def get_from_minio(file_path, bucket_name):
#     # Create a client with the MinIO server, its access key and secret key.
#     client = Minio("localhost:9000", access_key="minioadmin", secret_key="minioadmin", secure=False)

#     try:
#         # Check if 'bucket_name' exists.
#         if client.bucket_exists(bucket_name):
#             # Get 'file_path' from bucket 'bucket_name'.
#             data = client.get_object(bucket_name, file_path)
#             # Read the first kilobyte of data and print it
#             print(data.read(1024))
#         else:
#             print(f'Bucket {bucket_name} does not exist')

#     except S3Error as exc:
#         print(f'error occurred: {exc}')

In [15]:
def upload_to_minio(file_path, bucket_name):
    # Create a client with the MinIO server, its access key and secret key.
    client = Minio("localhost:9000", access_key="minioadmin", secret_key="minioadmin", secure=False)

    try:
        # Make 'bucket_name' if not exist.
        if not client.bucket_exists(bucket_name):
            client.make_bucket(bucket_name)

        # Upload 'file_path' as object name in bucket 'bucket_name'.
        client.fput_object(bucket_name, file_path, file_path)
        print(f'Successfully uploaded {file_path} to {bucket_name}')

    except S3Error as exc:
        print(f'error occurred: {exc}')


In [17]:
# This loads to redbox-storage-dev/data_eval/Universal-Basic-Income-Scotland-Report.pdf
# Is the 'data_eval' part of the path going to be a problem?
upload_to_minio("Universal-Basic-Income-Scotland-Report.pdf", "redbox-storage-dev")



MaxRetryError: HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /redbox-storage-dev?location= (Caused by ProtocolError('Connection aborted.', BadStatusLine('ÿ\x00\x00\x00\x00\x00\x00\x00\x01\x7fo\x01t: localhost:9000\r\n')))

Take a note of this file name. If it differs from `redbox-storage-dev/data_eval/Universal-Basic-Income-Scotland-Report.pdf`you will need up update the file name passed to the `ingest` function in the chunking playground section below

In [None]:
# file = "redbox-storage-dev/data_eval/Universal-Basic-Income-Scotland-Report.pdf"

[Back to top](#title)

---------

## Chunking Playground

Please experiment with different chunking strategies to see if the retrieval evalatuion metrics can be improved. **For MVP focus on `other_chunker`**

#### Document Chunking

Redbox currently has two types of chunking methods:
- `chunk_clustering`  | located in: redbox/parsing/chunk_clustering.py
- `other_chunker`     | located in: redbox/parsing/chunkers.py

#### [2024-05-14] Chunk Clustering steer for MVP
Chunk clustering stage (maybe we should call it agglomeration for accuracy?) currently takes a significant amount of processing in the upload phase of the user journey. Given we have limited capacity for evaluating the benefit it is bringing whilst we bring our benchmarks online we will disable the chunk clustering phase in the short term. 

This is with the view of reinstating further down the line and potentially exploring enhancements post MVP. These enhancements could be the recursive clustering idea or factoring document structure into the current algorithm.

#### Steps for follow
1. Create a git branch off of `main`
2. Make edits directly in the [Redbox parsing file](../../redbox/parsing/chunkers.py), within the `other_chunker` function
3. Follow the imports below and then continue with the rest of the notebook to get retrieval evaluation scores

**NOTE** If you make continue to make changes to the `other_chunkers` function, ensure you have run the `%autoreload` cell, for updates to be automatically reloaded into the kernel.

In [None]:
#TODO: Add %autoreload cell for imported parsing functions to be reloaded if they are changed.

[Back to top](#title)

-------------------

In [None]:
# Imports from worker/src/app.py
import logging
from datetime import datetime

# from fastapi import Depends, FastAPI
# from faststream.redis.fastapi import RedisRouter



In [None]:
# Additional functions imported from worker/src/app.py
from worker.src.app import get_storage_handler, get_chunker

In [None]:
#TODO: Do we need to set a flag to ensure other_chunker is used?

In [None]:
# from minio import Minio
# from minio.error import S3Error

# def get_from_minio(file_path, bucket_name):
#     # Create a client with the MinIO server, its access key and secret key.
#     client = Minio("localhost:9000", access_key="minioadmin", secret_key="minioadmin", secure=False)

#     try:
#         # Check if 'bucket_name' exists.
#         if client.bucket_exists(bucket_name):
#             # Get 'file_path' from bucket 'bucket_name'.
#             data = client.get_object(bucket_name, file_path)
#             # Read the first kilobyte of data and print it
#             print(data.read(1024))
#         else:
#             print(f'Bucket {bucket_name} does not exist')

#     except S3Error as exc:
#         print(f'error occurred: {exc}')

In [None]:
client = Minio("localhost:9000", access_key="minioadmin", secret_key="minioadmin", secure=False)

In [None]:
def ingest(
    file: File,
    storage_handler: ElasticsearchStorageHandler,
    chunker: FileChunker
):
    """
    1. Chunks file
    2. Puts chunks to ES
    3. Acknowledges message
    4. Puts chunk on embedder-queue
    """

    logging.info("Ingesting file: %s", file)

    chunks = chunker.chunk_file(file=file)

    logging.info("Writing %s chunks to storage for file uuid: %s", len(chunks), file.uuid)

    items = storage_handler.write_items(chunks)
    logging.info("written %s chunks to elasticsearch", len(items))

    return items



In [None]:
# Initialize a File
file = File(key='data_eval/Universal-Basic-Income-Scotland-Report.pdf', bucket='redbox-storage-dev', creator_user_uuid='123e4567-e89b-12d3-a456-426614174000')

In [None]:
# Call the ingest function
items = ingest(file, storage_handler, chunker)

In [None]:
items

In [None]:
from elasticsearch import Elasticsearch

# Initialize the Elasticsearch client
es = Elasticsearch(
                hosts=[
                    {
                        "host": "localhost",
                        "port": 9200,
                        "scheme": "http",
                    }
                ],
                basic_auth=("elastic", "redboxpass"),
            )

# Define the search query
query = {
    "query": {
        "match_all": {}
    }
}

# Execute the search query
response = es.search(index="redbox-data-chunk", body=query)

# Print the search results
for hit in response['hits']['hits']:
    print(hit['_source'])

In [None]:
from elasticsearch import Elasticsearch

# Initialize the Elasticsearch client
es = Elasticsearch(
    hosts=[
        {
            "host": "localhost",
            "port": 9200,
            "scheme": "http",
        }
    ],
    basic_auth=("elastic", "redboxpass"),
)

# Define the search query
query = {
    "query": {
        "match_all": {}
    },
    "size": 10000  # Increase this number to return more documents
}

# Execute the search query
response = es.search(index="redbox-data-chunk", body=query)

# Print the search results
for hit in response['hits']['hits']:
    print(hit['_source'])

In [None]:
# from minio import Minio
# from minio.error import S3Error

# def get_from_minio(file_path, bucket_name):
#     # Create a client with the MinIO server, its access key and secret key.
#     client = Minio("localhost:9000", access_key="minioadmin", secret_key="minioadmin", secure=False)

#     try:
#         # Check if 'bucket_name' exists.
#         if client.bucket_exists(bucket_name):
#             # Get 'file_path' from bucket 'bucket_name'.
#             data = client.get_object(bucket_name, file_path)
#             # Read the first kilobyte of data and print it
#             print(data.read(1024))
#         else:
#             print(f'Bucket {bucket_name} does not exist')

#     except S3Error as exc:
#         print(f'error occurred: {exc}')

-----

## Prompt Playground <a class="anchor" id="five-section"></a>

**Do an initial run through this notebook with the starting/default prompts BEFORE your first experiment.** This will give you baseline scores for each metric to compare your experiment results against.

Add baseline scores below

##### Baseline evaluation

**[2024-05-15] Baseline scores**
Using the deployed /chat/rag enpoint to get `actual_output` from Redbox RAG chat, we got the following baseline scores for each metric:
- Faithfulness: **#TODO: Populate after first run through**
- Answer Relevancy: **#TODO: Populate after first run through**
- Hallucination: **#TODO: Populate after first run through**

After you have done your first run through the notebook, please experiment with these prompts as you see fit.

Things to experiment with:
1. `_core_redbox_prompt`
2. `CORE_REDBOX_PROMPT`
3. `_with_sources_template`
4. `WITH_SOURCES_PROMPT`
5. `_stuff_document_template`
6. `STUFF_DOCUMENT_PROMPT`
7. The LLM being used - **For now, please stick with gpt-3.5-turbo, as we establish a baseline quality**

#### Refining Question Prompts
This refining of the question is pre-retrieval

In [None]:
_chat_template = """Given the following conversation and a follow up question,
rephrase the follow up question to be a standalone question, in its original
language. include the follow up instructions in the standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""

In [None]:
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_chat_template)

#### RAG prompts  <a class="anchor" id="six-section"></a>
These are the prompts that will have most effect on RAG generation

In [None]:
_core_redbox_prompt = """You are RedBox Copilot. An AI focused on helping UK Civil Servants, Political Advisors and\
Ministers triage and summarise information from a wide variety of sources. You are impartial and\
non-partisan. You are not a replacement for human judgement, but you can help humans\
make more informed decisions. If you are asked a question you cannot answer based on your following instructions, you\
should say so. Be concise and professional in your responses. Respond in markdown format.

=== RULES ===

All responses to Tasks **MUST** be translated into the user's preferred language.\
This is so that the user can understand your responses.\
"""

In [None]:
# Check where CORE_REDBOX_PROMPT is used in the codebase
CORE_REDBOX_PROMPT = PromptTemplate.from_template(_core_redbox_prompt)

In [None]:
_with_sources_template = """Given the following extracted parts of a long document and \
a question, create a final answer with Sources at the end.  \
If you don't know the answer, just say that you don't know. Don't try to make \
up an answer.
Be concise in your response and summarise where appropriate. \
At the end of your response add a "Sources:" section with the documents you used. \
DO NOT reference the source documents in your response. Only cite at the end. \
ONLY PUT CITED DOCUMENTS IN THE "Sources:" SECTION AND NO WHERE ELSE IN YOUR RESPONSE. \
IT IS CRUCIAL that citations only happens in the "Sources:" section. \
This format should be <DocX> where X is the document UUID being cited.  \
DO NOT INCLUDE ANY DOCUMENTS IN THE "Sources:" THAT YOU DID NOT USE IN YOUR RESPONSE. \
YOU MUST CITE USING THE <DocX> FORMAT. NO OTHER FORMAT WILL BE ACCEPTED.
Example: "Sources: <DocX> <DocY> <DocZ>"

Use **bold** to highlight the most question relevant parts in your response.
If dealing dealing with lots of data return it in markdown table format.

QUESTION: {question}
=========
{summaries}
=========
FINAL ANSWER:"""

In [None]:
WITH_SOURCES_PROMPT = PromptTemplate.from_template(_core_redbox_prompt + _with_sources_template)

In [None]:
_stuff_document_template = "<Doc{parent_doc_uuid}>{page_content}</Doc{parent_doc_uuid}>"

In [None]:
STUFF_DOCUMENT_PROMPT = PromptTemplate.from_template(_stuff_document_template)

If you find changes to the prompts above improve the generation evaluation scores, please consider making a PR to update the code in main.

All these prompts are locations in [chat.py](../../redbox/llm/prompts/chat.py), except `_core_redbox_prompt` which is located in [core.py](../../redbox/llm/prompts/core.py)

In [None]:
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_elasticsearch import ApproxRetrievalStrategy, ElasticsearchStore

In [None]:
log.info("Loading embedding model from environment: %s", env.embedding_model)
embedding_model = SentenceTransformerEmbeddings(model_name=env.embedding_model)
log.info("Loaded embedding model from environment: %s", env.embedding_model)

In [None]:
#TODO: Model cacching

In [None]:

if env.elastic.subscription_level == "basic":
    strategy = ApproxRetrievalStrategy(hybrid=False)
elif env.elastic.subscription_level in ["platinum", "enterprise"]:
    strategy = ApproxRetrievalStrategy(hybrid=True)
else:
    raise ValueError(f"Unknown Elastic subscription level {env.elastic.subscription_level}")


vector_store = ElasticsearchStore(
    es_connection=es,
    index_name="redbox-data-chunk",
    embedding=embedding_model,
    strategy=strategy,
    vector_query_field="embedding",
)

[Back to top](#title)

--------------

## Generate RAG responses and append them to evaluation dataset  <a class="anchor" id="seven-section"></a>

In [None]:

#TODO: Load required functions

#TODO: Any functions below that we need to mock?

# I would like to keep the rag_chat function unchanged from what it is in the core_api repo.

# However, the user_uuid is only used for authorisation (it is NOT used for authentication), so if too troublesome, can be removed.

**Create `ChatRequest` for evaluation dataset, used by the RAG chat function**

For the `ChatRequest` Pydantic model used by the RAG chat function, the JSON body should contain a `message_history` key with a list of chat messages.

Each chat message should match the structure defined by the `ChatMessage` model.

In [None]:
#TODO

In [None]:
chat_request = {
                "message_history": [
                        {"text": "You are a helpful AI Assistant", "role": "system"},
                        {"text": "What is Universal Basic Income?", "role": "user"},
                ]
               }

In [None]:
chat_request

In [None]:
# This only works within the function (or FastAPI), due to attribute access: question = chat_request.message_history[-1].text
question = chat_request["message_history"][-1]['text']

**RAG chat function - generation part**

In [None]:
# === LLM setup ===
llm = ChatLiteLLM(
    model="gpt-3.5-turbo",
    streaming=True,
)

In [None]:
# def rag_chat(chat_request: ChatRequest, user_uuid: Annotated[UUID, Depends(get_user_uuid)]) -> ChatResponse:
def rag_chat(chat_request, user_uuid) -> ChatResponse:    
    """Get a LLM response to a question history and file

    Args:


    Returns:
        StreamingResponse: a stream of the chain response
    """
    question = chat_request["message_history"][-1]["text"]
    previous_history = list(chat_request["message_history"][:-1])
    previous_history = ChatPromptTemplate.from_messages(
        (msg["role"], msg["text"]) for msg in previous_history
    ).format_messages()

    condense_question_chain = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT)

    standalone_question = condense_question_chain({"question": question, "chat_history": previous_history})["text"]

    docs = vector_store.as_retriever(
        search_kwargs={"filter": {"term": {"creator_user_uuid.keyword": str(user_uuid)}}}
    ).get_relevant_documents(standalone_question)

    docs_with_sources_chain = load_qa_with_sources_chain(
        llm,
        chain_type="stuff",
        prompt=WITH_SOURCES_PROMPT,
        document_prompt=STUFF_DOCUMENT_PROMPT,
        verbose=True,
    )

    result = docs_with_sources_chain(
        {
            "question": standalone_question,
            "input_documents": docs,
        },
    )

    source_documents = [
        SourceDocument(
            page_content=langchain_document.page_content,
            file_uuid=langchain_document.metadata.get("parent_doc_uuid"),
            page_numbers=langchain_document.metadata.get("page_numbers"),
        )
        for langchain_document in result.get("input_documents", [])
    ]
    return ChatResponse(output_text=result["output_text"], source_documents=source_documents)

#### Generate `actual_output` using RAG and evaluation dataset

In [None]:
#TODO: This is where we put it all together!

In [None]:
test = rag_chat(chat_request, 1234)

[Back to top](#title)

-------

#### Append `actual_output` to evaluation dataset
Process the goldens and convert them into test cases:

In [None]:
# A hypothetical LLM application example
from chatbot import query
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
...

def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
    test_cases = []
    for golden in goldens:
        test_case = LLMTestCase(
            input=golden.input,
            # Generate actual output using the 'input' and 'additional_metadata'
            actual_output = rag_chat(chat_request: ChatRequest, user_uuid=1234)
            actual_output = rag_chat
            actual_output=query(golden.input, golden.additional_metadata),
            expected_output=golden.expected_output,
            context=golden.context,
        )
        test_cases.append(test_case)
    return test_cases

# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)

[Back to top](#title)

## Run generation evaluation

In [None]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

---------

## Promote optimised prompts into production

If you find changes to the prompts above improve the generation evaluation scores, please consider making a PR to update the code in `core_api`. Follow these steps:

1. Create a new branch off `main`
2. Make changes in the locations listed below
3. Run through the e2e RAG evaluation notebook
4. If e2e RAG evaluation metrics are improved, please make a PR!

All these prompts are locations in [chat.py](../../redbox/llm/prompts/chat.py), except `_core_redbox_prompt` which is located in [core.py](../../redbox/llm/prompts/core.py)

[Back to top](#title)

--------