# Generation Evaluation for Redbox RAG chat  <a class="anchor" id="title"></a>

## Table of Contents <a class="anchor" id="toc"></a>
* [Overview](#one-section)
* [Metrics](#two-section)
    - [Fathfulness]()
    - [Answer Relevancy]()
    - [Hallucination]()
* [Evaluation Dataset](#three-section)
* [Evaluation Workflow in this Notebook](#four-section)
* [Prompt Playground](#five-section)
    - [RAG prompts](#six-section)
* [Generate RAG responses and append them to evaluation dataset](#seven-section)

* [eight](#eight-section)


## Overview <a class="anchor" id="one-section"></a>

When it comes to optimising the generation part of our RAG system, the only thing that we can modify are the `RAG prompts` that are passed with context to the LLM. Other components certainly play into the overall generation evaluation score, such as is the retrieved context of high-quality, but the levers to change these other components are further upstream in the RAG pipeline, and evaluated in Retrieval Evaluation and e2d Evaluation notebooks. These other components are also slower to change compared to prompts, which are just natural language!

We want to avoid using the /chat/rag endpoint for quick experimentation with `RAG prompts`, as the need to rebuild the core_api docker image, start and stop container etc will really slow down development --> changing prompts is very quick to do, so we want quick evaluation of how these prompt changes. 

For this reason, the /chat/rag endpoint function is in this notebook, and prompts can be changed in a single place, followed by much quicker feedback. If your prompt experiments look good, i.e. they improve generation evalution metrics, then you can consider making these changes in the `core_api` service. Information on where to make the corresponding changesin the the `core_api` service are at the bottom of this notebook. Once you make changes in `core_api` and rebuild, these changes will be reflected in the deployed /chat/rag endpoint.

We will evaluate RAG generation using metrics described in the next section.

[Back to top](#title)

---------------

## Metrics <a class="anchor" id="two-section"></a>

Start by using 3 DeepEval metrics:
- Faithfulness
- Answer Relevancy **(what are we taking as 'input'? Raw question or refined question?)**
- Hallucination

### Faithfulness

The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the `actual_output` factually aligns with the contents of your `retrieval_context`. `deepeval`'s faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

##### Required Arguments
To use the `FaithfulnessMetric`, you need to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`
- `retrieval_context`

[Back to top](#title)

### Answer Relevancy
The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output of your LLM application is compared to the provided `input`. `deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

##### Required Arguments
To use the AnswerRelevancyMetric, you'll have to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`

[Back to top](#title)

### Hallucination
The hallucination metric determines whether your LLM generates factually correct information by comparing the `actual_output` to the provided `context`.

##### Required Arguments
To use the HallucinationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`
- `retrieval_context`

[Back to top](#title)

-----------------

## Run Redbox locally
We want to take advantage of the document processing part of the redbox `file` api

In [51]:
import os
from jose import jwt
from uuid import UUID
import requests
import json

In [2]:
bearer_token = jwt.encode({"user_uuid": str(UUID("aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"))}, key="your-secret-key", algorithm="HS512")
print(bearer_token)

eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJ1c2VyX3V1aWQiOiJhYWFhYWFhYS1hYWFhLWFhYWEtYWFhYS1hYWFhYWFhYWFhYWEifQ.kwzm-8i8SveeqYqvsRUm4FiB7nd3I43aI70ImljgdudKM4xrDw9z3CUpEBRwqqh6D3ZghB2T-Lu7BlV36VR5sg


In [None]:
#TODO: Get absolute paths for the files you want to upload
/Users/andy/Documents/Projects/i-dot-ai/Test Documents/AI Safety

In [3]:
files = {'file': open('/Users/andy/Documents/Projects/i-dot-ai/Test Documents/AI Safety/The_impact_of_AI_on_UK_jobs_and_training_report.pdf', 'rb')}

### Upload all documents selected for evaluation

Set upload URL and header

In [70]:
url = 'http://127.0.0.1:5002/file/upload'

headers={
    'accept': 'application/json',
    "Authorization": f"Bearer {bearer_token}"
}

Get absoluate paths for all files to be used for evaluation.

**Please just update the directory variable below (if required), to the directory containinig all your files**

In [58]:
# Specify the directory you want to scan
directory = './evaluation_files_v1'

In [71]:
files = os.listdir(directory)

# Use os.path.join and os.path.abspath to get absolute paths
absolute_paths = [os.path.abspath(os.path.join(directory, file)) for file in files]


In [72]:
for file in absolute_paths:
    files = {'file': open(file, 'rb')}
    upload_file_response = requests.post(url, headers=headers, files=files)

    #TODO: Add some login in the loop to deal with status codes != 200
    # if upload_file_response.status_code != 200:
    #     print("Failed to upload data:", upload_file_response.status_code)

------

#### Get chunks

List files uploaded to server in current session

In [73]:
url = 'http://127.0.0.1:5002/file/'

headers={
    'accept': 'application/json',
    "Authorization": f"Bearer {bearer_token}"
}

file_list_response = requests.get(url, headers=headers)

View JSON response

In [74]:
if file_list_response.status_code == 200:
    # Parse JSON from the response
    data = file_list_response.json()
    
    # Pretty-print the JSON data
    pretty_json = json.dumps(data, indent=4)
    print(pretty_json)
else:
    print("Failed to retrieve data:", file_list_response.status_code)

[
    {
        "uuid": "2a8e50e5-265d-41c4-8bc3-b74d2b120ef5",
        "created_datetime": "2024-05-20T15:57:12.618618",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "Frontier AI Taskforce_ second progress report - GOV.UK.pdf",
        "bucket": "redbox-storage-dev",
        "model_type": "File"
    },
    {
        "uuid": "e57eed74-2b79-44ac-8b83-280dfabd2d3b",
        "created_datetime": "2024-05-20T17:50:38.723709",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "Prime Minister's speech on AI_ 26 October 2023 - GOV.UK.pdf",
        "bucket": "redbox-storage-dev",
        "model_type": "File"
    },
    {
        "uuid": "eef116c6-0198-4645-8fce-b6cfbccb72e6",
        "created_datetime": "2024-05-20T17:50:38.804081",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "The_impact_of_AI_on_UK_jobs_and_training_report.pdf",
        "bucket": "redbox-storage-dev",
        "model

Get a list of UUIDs

In [75]:
uuid_list = []

for item in data:
    if 'uuid' in item:
        uuid_list.append({'uuid': item['uuid']})

print(uuid_list)

[{'uuid': '2a8e50e5-265d-41c4-8bc3-b74d2b120ef5'}, {'uuid': 'e57eed74-2b79-44ac-8b83-280dfabd2d3b'}, {'uuid': 'eef116c6-0198-4645-8fce-b6cfbccb72e6'}, {'uuid': '057a9a74-374c-4cab-a10c-07ce8181c99e'}]


Get file status

In [77]:
status_url_list = []
for uuid in uuid_list:
    url = f"http://127.0.0.1:5002/file/{uuid['uuid']}/status"
    status_url_list.append(url)

In [78]:
#TODO: Check this code works with > 1 document
status_responses = []
for url in status_url_list:
    status_response = requests.get(url, headers=headers)
    status_responses.append(status_response)

In [79]:
#TODO check all status_code == 200
status_responses

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]

In [80]:
for status in status_responses:
    data = status.json()
    pretty_json = json.dumps(data, indent=4)
    print(pretty_json)

{
    "file_uuid": "2a8e50e5-265d-41c4-8bc3-b74d2b120ef5",
    "processing_status": "complete",
    "chunk_statuses": [
        {
            "chunk_uuid": "496b9637-48b8-48d9-acc7-ea4cac9af2b2",
            "embedded": true
        },
        {
            "chunk_uuid": "4129def6-a2a0-4333-8ac0-877d4c4dd98a",
            "embedded": true
        },
        {
            "chunk_uuid": "21a573c8-dc01-4576-b3d9-61e27f900a37",
            "embedded": true
        },
        {
            "chunk_uuid": "48d83e2e-f27a-4615-ae91-6ba950d7da18",
            "embedded": true
        },
        {
            "chunk_uuid": "678c8642-ecf7-4922-bb19-32c0667285b8",
            "embedded": true
        },
        {
            "chunk_uuid": "a416b8d8-8d06-4946-ac5b-06b7289ec6af",
            "embedded": true
        },
        {
            "chunk_uuid": "5cfd7dd6-e42e-43a8-b47a-882347d54394",
            "embedded": true
        },
        {
            "chunk_uuid": "da0e77bf-6f62-44d3-b544-4699d2e

#### Get chunks for each file

In [84]:
chunks_url_list = []
for uuid in uuid_list:
    url = f"http://127.0.0.1:5002/file/{uuid['uuid']}/chunks"
    chunks_url_list.append(url)

In [82]:
chunks_responses = []
for url in chunks_url_list:
    chunks_response = requests.get(url, headers=headers)
    chunks_responses.append(chunks_response)

In [83]:
chunks_responses

[<Response [200]>, <Response [200]>, <Response [200]>, <Response [200]>]

In [85]:
uuid_text_pairs_list = []
for chunk_response in chunks_responses:
    if chunks_response.status_code == 200:
        # Parse JSON from the response
        data = chunks_response.json()
        uuid_text_pairs = [(item["uuid"], item["text"]) for item in data]
        uuid_text_pairs_list.append(uuid_text_pairs)

------------

**NEED TO PULL IN FROM MAIN BUG FIX, AS ONLY ONE FILE CHUNKS ARE BEING CREATED!**

------------------

In [94]:
uuid_text_pairs_list[2][0]

('27fe9356-41d6-4e4f-b761-01b0e03d80f5',
 'The mental health effects of a universal basic income\n\nRegistered Charity No. 801130 (England), SC039714\n\n(Scotland). Company Registration No. 2350846.\n\nA Mental Health Foundation report This report was led by Dr Naomi Wilson and Dr Shari McDaid\n\nRecommended citation:\n\nWilson N. and McDaid S. (2021) The Mental Health Effects of a Universal Basic Income.\n\nGlasgow: The Mental Health Foundation.\n\n@MHFScot\n\n@mentalhealthfoundation')

In [49]:
uuid_text_pairs[0]

('99a95b52-482c-4364-95f4-0b7048aad414',
 'Furthermore, we are excited to welcome Rumman Chowdhury, who will be working with the Taskforce to develop its work on safety infrastructure, as well as its work on evaluating societal impacts from AI. Rumman is the CEO and co- founder of Humane Intelligence, and led efforts for the largest generative AI public red-teaming event at DEFCON this year. She is also a Responsible AI fellow at the Harvard Berkman Klein Center, and previously led the META (ML Ethics, Transparency, and Accountability)')

In [47]:
uuid_text_pairs[0][1]

'Furthermore, we are excited to welcome Rumman Chowdhury, who will be working with the Taskforce to develop its work on safety infrastructure, as well as its work on evaluating societal impacts from AI. Rumman is the CEO and co- founder of Humane Intelligence, and led efforts for the largest generative AI public red-teaming event at DEFCON this year. She is also a Responsible AI fellow at the Harvard Berkman Klein Center, and previously led the META (ML Ethics, Transparency, and Accountability)'

Something to do with version of elastic search container not pinned? Pulling in 8.12.0 but should be using 8.11.0

Follow the steps below to get the chunks for the evaluation document(s):

1. Run app locally WITHOUT detached mode: `docker compose up -d elasticsearch kibana worker minio redis core-api db`

2. View Swagger UI for /file endpoint at: `http://127.0.0.1:5002/file/docs`

3. Authorise yourself. Top right of Swagger docs, click on Authorise button. Paste in the `bearer_token` generated in the call above.

4. Upload documents selected for evaluation

5. Take a note of the uuid(s), e.g. 7b550232-35c4-48fd-8d7a-ba364c1378c4 (this will change each time you run locally)

Chunking happens very quickly. Embedding takes more time, but will give you a boolean flag on complete.

6. From the Swagger UI, use the `file/{uuid}/status` endpoint to check status. Use the `uuid`s noted in step 4

7. From the Swagger UI use the `{file_uuid}/chunks` endpoint to get the chunks required for the next step of evaluation. Use the `uuid`s noted in step 4 to get chunks required for evaluation.

The complete output can be downloaded in JSON format from the Swagger UI docs page

8. Move downloaded response into `notebooks/evaluation/data_eval` folder

## Notebook Setup <a class="anchor" id="five-section"></a>

Some basic setup for RAG chat function to work during experimentation

In [16]:
from langchain.prompts.prompt import PromptTemplate

**Import Pydantic chat models, used by the RAG chat function**

In [15]:
from redbox.models.chat import ChatMessage, ChatRequest, ChatResponse, SourceDocument

**Mock HTTP Authorization Credentials, used by the RAG chat function**

In [None]:
# from core_api.src.auth import get_user_uuid
# from fastapi.security import HTTPAuthorizationCredentials

In [None]:
# Mock HTTP Authorization Credentials
# credentials = Mock(spec=HTTPAuthorizationCredentials)
# credentials.credentials = "mock_token"

**RAG chat function imports**

In [14]:
from langchain.chains.llm import LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain_community.chat_models import ChatLiteLLM
from langchain_core.prompts import ChatPromptTemplate

[Back to top](#title)

---------------

## Evaluation Dataset <a class="anchor" id="three-section"></a>

### Load dataset

Currently having issues loading test cases from JSON. Check formating and Discord for any known issues

In [None]:
# from deepeval.dataset import EvaluationDataset

# dataset = EvaluationDataset()
# dataset.add_test_cases_from_json_file(
#     # file_path is the absolute path to you .json file
#     file_path="/Users/andy/tw/gen-ai/redbox-copilot/notebooks/evaluation/data/synthetic_data/ragas_synthetic_data_10.json",
#     # file_path="./data/synthetic_data/ragas_synthetic_data2.json",
#     input_key_name="input",
#     actual_output_key_name="actual_output",
#     expected_output_key_name="expected_output",
#     context_key_name="context",
#     # retrieval_context_key_name="retrieval_context",
# )

Import test cases from CSV working fine

In [9]:
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_test_cases_from_csv_file(
    # file_path is the absolute path to you .csv file
    file_path="/Users/andy/tw/gen-ai/redbox-copilot/notebooks/evaluation/data/synthetic_data/ragas_synthetic_data_10.csv",
    input_col_name="input",
    actual_output_col_name="actual_output",
    expected_output_col_name="expected_output",
    context_col_name="context",
    context_col_delimiter= ";",
    retrieval_context_col_name="retrieval_context",
    retrieval_context_col_delimiter= ";"
)

[Back to top](#title)

------

## Evaluation Workflow in this Notebook <a class="anchor" id="four-section"></a>

Follow these steps to run an experiment:
1. Make experimental changes to [`RAG prompts`]() - these will be used by the /chat/rag function
2. (Optional) Make experimental changes to the [/chat/rag function]() 
3. Pass the evaluation dataset through the /chat/rag function to general `actual_output` and append these to the evaluation dataset
4. Run evaluations on dataset to calcuate generation evaluation metrics

[Back to top](#title)

-------------

## Prompt Playground <a class="anchor" id="five-section"></a>

**Do an initial run through this notebook with the starting/default prompts BEFORE your first experiment.** This will give you baseline scores for each metric to compare your experiment results against.

Add baseline scores below

##### Baseline evaluation

**[2024-05-15] Baseline scores**
Using the deployed /chat/rag enpoint to get `actual_output` from Redbox RAG chat, we got the following baseline scores for each metric:
- Faithfulness: **#TODO: Populate after first run through**
- Answer Relevancy: **#TODO: Populate after first run through**
- Hallucination: **#TODO: Populate after first run through**

After you have done your first run through the notebook, please experiment with these prompts as you see fit.

Things to experiment with:
1. `_core_redbox_prompt`
2. `CORE_REDBOX_PROMPT`
3. `_with_sources_template`
4. `WITH_SOURCES_PROMPT`
5. `_stuff_document_template`
6. `STUFF_DOCUMENT_PROMPT`
7. The LLM being used - **For now, please stick with gpt-3.5-turbo, as we establish a baseline quality**

#### RAG prompts  <a class="anchor" id="six-section"></a>
These are the prompts that will have most effect on RAG generation

In [12]:
_core_redbox_prompt = """You are RedBox Copilot. An AI focused on helping UK Civil Servants, Political Advisors and\
Ministers triage and summarise information from a wide variety of sources. You are impartial and\
non-partisan. You are not a replacement for human judgement, but you can help humans\
make more informed decisions. If you are asked a question you cannot answer based on your following instructions, you\
should say so. Be concise and professional in your responses. Respond in markdown format.

=== RULES ===

All responses to Tasks **MUST** be translated into the user's preferred language.\
This is so that the user can understand your responses.\
"""

In [17]:
# Check where CORE_REDBOX_PROMPT is used in the codebase
CORE_REDBOX_PROMPT = PromptTemplate.from_template(_core_redbox_prompt)

In [18]:
_with_sources_template = """Given the following extracted parts of a long document and \
a question, create a final answer with Sources at the end.  \
If you don't know the answer, just say that you don't know. Don't try to make \
up an answer.
Be concise in your response and summarise where appropriate. \
At the end of your response add a "Sources:" section with the documents you used. \
DO NOT reference the source documents in your response. Only cite at the end. \
ONLY PUT CITED DOCUMENTS IN THE "Sources:" SECTION AND NO WHERE ELSE IN YOUR RESPONSE. \
IT IS CRUCIAL that citations only happens in the "Sources:" section. \
This format should be <DocX> where X is the document UUID being cited.  \
DO NOT INCLUDE ANY DOCUMENTS IN THE "Sources:" THAT YOU DID NOT USE IN YOUR RESPONSE. \
YOU MUST CITE USING THE <DocX> FORMAT. NO OTHER FORMAT WILL BE ACCEPTED.
Example: "Sources: <DocX> <DocY> <DocZ>"

Use **bold** to highlight the most question relevant parts in your response.
If dealing dealing with lots of data return it in markdown table format.

QUESTION: {question}
=========
{summaries}
=========
FINAL ANSWER:"""

In [19]:
WITH_SOURCES_PROMPT = PromptTemplate.from_template(_core_redbox_prompt + _with_sources_template)

In [20]:
_stuff_document_template = "<Doc{parent_doc_uuid}>{page_content}</Doc{parent_doc_uuid}>"

In [21]:
STUFF_DOCUMENT_PROMPT = PromptTemplate.from_template(_stuff_document_template)

If you find changes to the prompts above improve the generation evaluation scores, please consider making a PR to update the code in main.

All these prompts are locations in [chat.py](../../redbox/llm/prompts/chat.py), except `_core_redbox_prompt` which is located in [core.py](../../redbox/llm/prompts/core.py)

[Back to top](#title)

--------------

## Generate RAG responses and append them to evaluation dataset  <a class="anchor" id="seven-section"></a>

#TODO: Start with existing Core API to get baseline metrics

In [None]:

#TODO: Load required functions

#TODO: Any functions below that we need to mock?

# I would like to keep the rag_chat function unchanged from what it is in the core_api repo.

# However, the user_uuid is only used for authorisation (it is NOT used for authentication), so if too troublesome, can be removed.

**Create `ChatRequest` for evaluation dataset, used by the RAG chat function**

For the `ChatRequest` Pydantic model used by the RAG chat function, the JSON body should contain a `message_history` key with a list of chat messages.

Each chat message should match the structure defined by the `ChatMessage` model.

In [None]:
#TODO

In [None]:
chat_request = {
                "message_history": [
                        {"text": "You are a helpful AI Assistant", "role": "system"},
                        {"text": "What is AI?", "role": "user"},
                ]
               }

In [None]:
chat_request

In [None]:
# This only works within the function (or FastAPI), due to attribute access: question = chat_request.message_history[-1].text
question = chat_request["message_history"][-1]['text']

**RAG chat function - generation part**

In [22]:
# === LLM setup ===
llm = ChatLiteLLM(
    model="gpt-3.5-turbo",
    streaming=True,
)

In [23]:
def rag_chat(chat_request: ChatRequest, user_uuid: Annotated[UUID, Depends(get_user_uuid)]) -> ChatResponse:
    """Get a LLM response to a question history and file

    Args:


    Returns:
        StreamingResponse: a stream of the chain response
    """
    question = chat_request.message_history[-1].text
    previous_history = list(chat_request.message_history[:-1])
    previous_history = ChatPromptTemplate.from_messages(
        (msg.role, msg.text) for msg in previous_history
    ).format_messages()

    condense_question_chain = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT)

    standalone_question = condense_question_chain({"question": question, "chat_history": previous_history})["text"]

    docs = vector_store.as_retriever(
        search_kwargs={"filter": {"term": {"creator_user_uuid.keyword": str(user_uuid)}}}
    ).get_relevant_documents(standalone_question)

    docs_with_sources_chain = load_qa_with_sources_chain(
        llm,
        chain_type="stuff",
        prompt=WITH_SOURCES_PROMPT,
        document_prompt=STUFF_DOCUMENT_PROMPT,
        verbose=True,
    )

    result = docs_with_sources_chain(
        {
            "question": standalone_question,
            "input_documents": docs,
        },
    )

    source_documents = [
        SourceDocument(
            page_content=langchain_document.page_content,
            file_uuid=langchain_document.metadata.get("parent_doc_uuid"),
            page_numbers=langchain_document.metadata.get("page_numbers"),
        )
        for langchain_document in result.get("input_documents", [])
    ]
    return ChatResponse(output_text=result["output_text"], source_documents=source_documents)

NameError: name 'Annotated' is not defined

#### Generate `actual_output` using RAG and evaluation dataset

In [None]:
#TODO: This is where we put it all together!

[Back to top](#title)

-------

#### Append `actual_output` to evaluation dataset
Process the goldens and convert them into test cases:

In [None]:
# A hypothetical LLM application example
from chatbot import query
from typing import List
from deepeval.test_case import LLMTestCase
from deepeval.dataset import Golden
...

def convert_goldens_to_test_cases(goldens: List[Golden]) -> List[LLMTestCase]:
    test_cases = []
    for golden in goldens:
        test_case = LLMTestCase(
            input=golden.input,
            # Generate actual output using the 'input' and 'additional_metadata'
            actual_output = rag_chat(chat_request: ChatRequest, user_uuid=1234)
            actual_output = rag_chat
            actual_output=query(golden.input, golden.additional_metadata),
            expected_output=golden.expected_output,
            context=golden.context,
        )
        test_cases.append(test_case)
    return test_cases

# Data preprocessing before setting the dataset test cases
dataset.test_cases = convert_goldens_to_test_cases(dataset.goldens)

[Back to top](#title)

## Run generation evaluation

In [None]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

---------

## Promote optimised prompts into production

If you find changes to the prompts above improve the generation evaluation scores, please consider making a PR to update the code in `core_api`. Follow these steps:

1. Create a new branch off `main`
2. Make changes in the locations listed below
3. Run through the e2e RAG evaluation notebook
4. If e2e RAG evaluation metrics are improved, please make a PR!

All these prompts are locations in [chat.py](../../redbox/llm/prompts/chat.py), except `_core_redbox_prompt` which is located in [core.py](../../redbox/llm/prompts/core.py)

[Back to top](#title)

--------