# Evaluation for Redbox RAG chat  <a class="anchor" id="title"></a>

-----------

**Evaluate Redbox RAG chat on one stable, numbered version of these data**

-----------------------

## Table of Contents <a class="anchor" id="toc"></a>
* [Overview](#overview)
* [Metrics](#metrics)
    - [Contextual Precisio]()
    - [Contextual Recall]()
    - [Contextual Relevancy]()
    - [Fathfulness]()
    - [Answer Relevancy]()
    - [Hallucination]()
    
**CODE STARTS HERE**
* [Set version of the evaluation dataset](#setversion)
* [Imports](#imports)
* [Run Redbox Locally](#run-redbox)
* [Get files that correspond to the version of evaluation dataset](#files)
* [Load Evaluation Dataset into test cases](#load-test-cases)

* [Generate `actual_output` using RAG and evaluation dataset](#evaluate)
    - [Retrieval Evaluation Metrics]()
    - [Generation Evaluation Metrics]()
* [Analyse evaluation results](#analysis)

------------

## Overview <a class="anchor" id="overview"></a>

Redbox RAG chat is made up of many components that work together to give the final RAG pipeline.Each component can be optimised, to hopefully improve the over all performance of the RAG pipeline for Redbox tasks. In order to track if changes made are improving or degrading Redbox performance, we need to establish an evaluation framework. The overall RAG pipeline can be broken down into two main parts:

1. Retrieval - searching and returning the most relevant documents to answer a user question
2. Generation - the ouput of the LLM after considering the retrieved documents, any prompts provided and the user question

This notebook tests both the retrieval and generation sides of the RAG pipeline using specific metrics for each, using the `DeepEval` framework.


For consistency across the team, it is important to evaluate Redbox RAG chat on one stable, numbered version of these data.


[Back to top](#title)


-------

## Metrics <a class="anchor" id="metrics"></a>

Retrieval metrics
- Contextual Precision
- Contextual Recall
- Contextual Relevancy

Generation metrics
- Faithfulness
- Answer Relevancy
- Hallucination


### Contextual Precision

The contextual precision metric measures your RAG pipeline's retriever by evaluating whether nodes in your `retrieval_context` that are relevant to the given `input` are ranked higher than irrelevant ones.

### Contextual Recall

The contextual recall metric measures the quality of your RAG pipeline's retriever by evaluating the extent of which the `retrieval_context` aligns with the `expected_output`.

### Contextual Relevancy

The contextual relevancy metric measures the quality of your RAG pipeline's retriever by evaluating the overall relevance of the information presented in your `retrieval_context` for a given `input`.

### Faithfulness

The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the `actual_output` factually aligns with the contents of your `retrieval_context`. `deepeval`'s faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

##### Required Arguments
To use the `FaithfulnessMetric`, you need to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`
- `retrieval_context`

[Back to top](#title)

### Answer Relevancy
The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output of your LLM application is compared to the provided `input`. `deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

##### Required Arguments
To use the AnswerRelevancyMetric, you'll have to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`

[Back to top](#title)

### Hallucination
The hallucination metric determines whether your LLM generates factually correct information by comparing the `actual_output` to the provided `context`.

##### Required Arguments
To use the HallucinationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`
- `retrieval_context`

[Back to top](#title)

-------------------

**Evaluate Redbox RAG chat on one stable, numbered version of these data**

**Set the version of the evaluation dataset you are using to evalute Redbox in the cell below**   <a class="anchor" id="setversion"></a>

In [67]:
DATA_VERSION = "0.1.0"

Run the cell below to set up the required folder structure (it will not overwrite folders and files if they already exist)

In [68]:
from pathlib import Path

ROOT = Path.cwd().parents[1]
EVALUATION_DIR = ROOT / "notebooks/evaluation"

V_ROOT = EVALUATION_DIR / f"data/{DATA_VERSION}"
V_RAW = V_ROOT / "raw"
V_SYNTHETIC = V_ROOT / "synthetic"
V_CHUNKS = V_ROOT / "chunks"
V_RESULTS = V_ROOT / "results"

V_ROOT.mkdir(parents=True, exist_ok=True)
V_RAW.mkdir(parents=True, exist_ok=True)
V_SYNTHETIC.mkdir(parents=True, exist_ok=True)
V_CHUNKS.mkdir(parents=True, exist_ok=True)
V_RESULTS.mkdir(parents=True, exist_ok=True)

To save on API costs, we only need to generate a particular version of the evaluation dataset once. If you are using a previously generaterated evalutation dataset, please download it from shared team location (Google Drive).

---------

#### Imports <a id="imports"></a>

In [69]:
# Add autoreloatd
%reload_ext autoreload
%autoreload 2

In [70]:
from jose import jwt
from uuid import UUID
import requests
import json
import pandas as pd

In [71]:
from dotenv import find_dotenv, load_dotenv
_ = load_dotenv(find_dotenv())

[Back to top](#title)

-----------------

## Start Redbox Running locally <a id="run-redbox"></a>

Start docker runtime, either Docker Desktop. However, if you are using colima run the following terminal command
```bash
colima start --memory 8
``` 

---------------------

#### First-time setup

First time users need to do the following

```bash
poetry install
```

Ensure you .env file has OpenAI API key in and has the following settings:

```bash
# === Object Storage ===

MINIO_HOST=minio
MINIO_PORT=9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
AWS_ACCESS_KEY=minioadmin
AWS_SECRET_KEY=minioadmin

AWS_REGION=eu-west-2

# minio or s3
OBJECT_STORE=minio
BUCKET_NAME=redbox-storage-dev
```

Build redbox docker images (this takes several minutes)

```bash
docker compose build
```

------

#### Build containers

If changes are made to the app, e.g. changes pulled in from main, it may require rebuilding docker images

```bash
docker compose build --no-cache
```

#### Run Redbox locally

**Every time you start Redbox for evaluation (no Django frontend required), please run the following command**

```bash
make eval_backend
````

The above command will bring up everything you need for the backend (`core-api`, `worker`, `mino`, `elasticsearch` and `redis`), then create the MinIO bucket needed to store raw files

[Back to top](#title)

----------

## Generate `actual_output` using RAG and evaluation dataset

#### First need to upload files that we are going to 'RAG with'

**Use the printed out bearer token below to Authorize if you ever want to use the Swagger UI docs**

In [7]:
bearer_token = jwt.encode({"user_uuid": str(UUID("aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"))}, key="your-secret-key", algorithm="HS512")
print(bearer_token)

eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJ1c2VyX3V1aWQiOiJhYWFhYWFhYS1hYWFhLWFhYWEtYWFhYS1hYWFhYWFhYWFhYWEifQ.kwzm-8i8SveeqYqvsRUm4FiB7nd3I43aI70ImljgdudKM4xrDw9z3CUpEBRwqqh6D3ZghB2T-Lu7BlV36VR5sg


#### Get files that correspond to the version of evaluation dataset  <a class="anchor" id="files"></a>

Copy all the files that match your {DATA_VERSION} into `notebooks/evaluation/data/{DATA_VERSION}/raw/`. Find these files on the shared Google Drive and the corresponding version number/location

**It is really important to use the same files that were used to genearte this particular version of the evaluation dataset. A mismatch between the two will result in inaccurate evaluatoin metrics**

**Only if you haven't uploaded files already** uncomment and run cell below

In [None]:
# url = 'http://127.0.0.1:5002/file/upload'

# headers={
#     'accept': 'application/json',
#     "Authorization": f"Bearer {bearer_token}"
# }
# for file in V_RAW.glob("*.*"):
#     files = {'file': open(file, 'rb')}
#     upload_file_response = requests.post(url, headers=headers, files=files)

#     #TODO: Add some login in the loop to deal with status codes != 200
#     # if upload_file_response.status_code != 200:
#     #     print("Failed to upload data:", upload_file_response.status_code)

List files uploaded to server & view JSON response

In [73]:
url = 'http://127.0.0.1:5002/file/'

headers={
    'accept': 'application/json',
    "Authorization": f"Bearer {bearer_token}"
}

file_list_response = requests.get(url, headers=headers)

if file_list_response.status_code == 200:
    # Parse JSON from the response
    data = file_list_response.json()
    
    # Pretty-print the JSON data
    pretty_json = json.dumps(data, indent=4)
    print(pretty_json)
else:
    print("Failed to retrieve data:", file_list_response.status_code)

[
    {
        "uuid": "f52a6c97-c234-40e5-a9af-94daab9035c1",
        "created_datetime": "2024-05-21T11:54:52.633856",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "Frontier AI Taskforce_ second progress report - GOV.UK.pdf",
        "bucket": "redbox-storage-dev",
        "model_type": "File"
    },
    {
        "uuid": "6bb3be69-5b5b-4ad5-9623-0eb783d3502a",
        "created_datetime": "2024-05-21T11:54:52.707323",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "Prime Minister's speech on AI_ 26 October 2023 - GOV.UK.pdf",
        "bucket": "redbox-storage-dev",
        "model_type": "File"
    },
    {
        "uuid": "961985cf-c5a1-4eeb-b472-25a06c8ef5dd",
        "created_datetime": "2024-05-21T11:54:52.778875",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "The_impact_of_AI_on_UK_jobs_and_training_report.pdf",
        "bucket": "redbox-storage-dev",
        "model

Get file status

In [74]:
from redbox.models import FileStatus

def pretty_upload_status(file_uuid: UUID, bearer_token: str) -> str:
    headers={
        'accept': 'application/json',
        "Authorization": f"Bearer {bearer_token}"
    }
    status = FileStatus(**requests.get(f"http://127.0.0.1:5002/file/{file_uuid}/status", headers=headers).json())

    status_title = status.processing_status.title()
    n_chunks = 0
    if status.chunk_statuses is not None:
        n_chunks = len(status.chunk_statuses)

    if status.processing_status == "embedding" and n_chunks > 0 and status.chunk_statuses is not None:
        n_chunks_embedded = len([chunk for chunk in status.chunk_statuses if chunk.embedded])
        return f"{status_title} ({n_chunks_embedded / n_chunks:.0%})"
    else:
        return status_title

statuses = [pretty_upload_status(file["uuid"], bearer_token) for file in file_list_response.json()]
statuses

['Complete', 'Complete', 'Complete', 'Complete']

------------

**Please ensure all emeddings have been completed before proceeding!**

Keep calm and go for a tea break!

--------------

#### Generate `actual_output` & `retrieval_context`

In [75]:
df = pd.read_csv(f'{V_SYNTHETIC}/ragas_synthetic_data.csv')
inputs = df['input'].tolist()

##### Using a `rag_chat()` function

We can conceptualise RAG as having four mechanisms we might tune:

* Chunking
* Embedding
* Retriever
* Prompts

The below `rag_chat()` function replicates the internal logic of the RAG endpoint. By editing and using it here, you can quickly iterate and test the retriever and prompt mechanisms using your stable, versioned data, giving sharable, reproducible results.

As long as `rag_chat()` takes a question (and history) and produces an answer, it's a testable process that could be used in Redbox. Everything within the function is yours to play with -- prompts, retriever, everything.

In [21]:
from redbox.models import Settings

from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_elasticsearch import ApproxRetrievalStrategy, ElasticsearchStore
from langchain_community.chat_models import ChatLiteLLM
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chains.llm import LLMChain
from langchain.prompts.prompt import PromptTemplate
from langchain.globals import set_verbose

set_verbose(False)

# Core variables

ENV = Settings()
ENV.elastic.host = "localhost"
LLM = ChatLiteLLM(
    model="gpt-3.5-turbo",
    streaming=True,
)
EMBEDDING_MODEL = SentenceTransformerEmbeddings(model_name=ENV.embedding_model)
VECTOR_STORE = ElasticsearchStore(
    es_connection=ENV.elasticsearch_client(),
    index_name="redbox-data-chunk",
    embedding=EMBEDDING_MODEL,
    strategy=ApproxRetrievalStrategy(hybrid=False),
    vector_query_field="embedding",
)
RETRIEVER = VECTOR_STORE.as_retriever(
    search_kwargs={
        "filter": {
            "term": {
                "creator_user_uuid": str(UUID("aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"))
            }
        }
    }
)

# Prompts

WITH_SOURCES_PROMPT = """You are RedBox Copilot. An AI focused on helping UK Civil Servants, Political Advisors and\
Ministers triage and summarise information from a wide variety of sources. You are impartial and\
non-partisan. You are not a replacement for human judgement, but you can help humans\
make more informed decisions. If you are asked a question you cannot answer based on your following instructions, you\
should say so. Be concise and professional in your responses. Respond in markdown format.

=== RULES ===

All responses to Tasks **MUST** be translated into the user's preferred language.\
This is so that the user can understand your responses.\
Given the following extracted parts of a long document and \
a question, create a final answer with Sources at the end.  \
If you don't know the answer, just say that you don't know. Don't try to make \
up an answer.
Be concise in your response and summarise where appropriate. \
At the end of your response add a "Sources:" section with the documents you used. \
DO NOT reference the source documents in your response. Only cite at the end. \
ONLY PUT CITED DOCUMENTS IN THE "Sources:" SECTION AND NO WHERE ELSE IN YOUR RESPONSE. \
IT IS CRUCIAL that citations only happens in the "Sources:" section. \
This format should be <DocX> where X is the document UUID being cited.  \
DO NOT INCLUDE ANY DOCUMENTS IN THE "Sources:" THAT YOU DID NOT USE IN YOUR RESPONSE. \
YOU MUST CITE USING THE <DocX> FORMAT. NO OTHER FORMAT WILL BE ACCEPTED.
Example: "Sources: <DocX> <DocY> <DocZ>"

Use **bold** to highlight the most question relevant parts in your response.
If dealing dealing with lots of data return it in markdown table format.

QUESTION: {question}
=========
{summaries}
=========
FINAL ANSWER:"""

STUFF_DOCUMENT_PROMPT = "<Doc{parent_doc_uuid}>{page_content}</Doc{parent_doc_uuid}>"

CONDENSE_QUESTION_PROMPT = """Given the following conversation and a follow up question,
rephrase the follow up question to be a standalone question, in its original
language. include the follow up instructions in the standalone question.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""

# RAG function

def rag_chat(question: str, previous_history: list[tuple[str, str]] | None = None) -> str:
    docs_with_sources_chain = load_qa_with_sources_chain(
        LLM,
        chain_type="stuff",
        prompt=PromptTemplate.from_template(WITH_SOURCES_PROMPT),
        document_prompt=PromptTemplate.from_template(STUFF_DOCUMENT_PROMPT),
        verbose=True,
    )

    condense_question_chain = LLMChain(
        llm=LLM, 
        prompt=PromptTemplate.from_template(CONDENSE_QUESTION_PROMPT),
        verbose=False
    )

    standalone_question = condense_question_chain(
        {"question": question, "chat_history": previous_history}
    )["text"]

    docs = RETRIEVER.get_relevant_documents(standalone_question)

    result = docs_with_sources_chain(
        {
            "question": standalone_question,
            "input_documents": docs,
        },
    )

    source_documents = [
        {
            "page_content": langchain_document.page_content,
            "file_uuid": langchain_document.metadata.get("parent_doc_uuid"),
            "page_numbers": langchain_document.metadata.get("page_numbers"),
        }
        for langchain_document in result.get("input_documents", [])
    ]

    return {
        "output_text": result["output_text"],
        "source_documents": source_documents
    }




In [22]:
%%capture

df_function = df.copy()

retrieval_context = []
actual_output = []

for question in inputs:
    data = rag_chat(question=question, previous_history=None)

    retrieval_context.append(data['source_documents'])
    actual_output.append(data['output_text'])

df_function['actual_output'] = actual_output
df_function['retrieval_context'] = retrieval_context

##### Using the RAG endpoint

In [13]:
df_endpoint = df.copy()

retrieval_context = []
actual_output = []

headers = {
    'accept': 'application/json',
    'Authorization': 'Bearer ' + bearer_token,
    'Content-Type': 'application/json',
}

url = 'http://127.0.0.1:5002/chat/rag'

for question in inputs:
    data = {
        "message_history": [
            {
                "role": "user",
                "text": question
            }
        ]
    }
    
    response = requests.post(url, headers=headers, data=json.dumps(data))
    data = response.json()

    retrieval_context.append(data['source_documents'])
    actual_output.append(data['output_text'])

df_endpoint['actual_output'] = actual_output
df_endpoint['retrieval_context'] = retrieval_context

#### Confirm actual_output & retrieved_context added to the dataframe

In [77]:
df_function.head()

Unnamed: 0,input,context,expected_output,actual_output,retrieval_context
0,How did changes to social security payments in...,"['\nOverall, at the end of the study, almost a...",Changes to social security payments in the Net...,**The changes to social security payments in t...,[{'page_content': 'changes to social security ...
1,How does childhood poverty impact the likeliho...,"[', Guettabi & Reimer, 2019).\n\nSimilarly, in...",The probability of obesity in young adults rec...,**Childhood poverty** can impact the likelihoo...,"[{'page_content': 'Similarly, in Western Carol..."
2,How has the share of renewable energy generati...,[' to move away from Russian gas.\n\n· Electri...,Renewable generation rose 12 per cent on the s...,**Renewable generation in the UK rose by 12% c...,[{'page_content': 'Renewable generation rose 1...
3,How has electricity generation by Major Power ...,[' to move away from Russian gas.\n\n· Electri...,Electricity generation by Major Power Producer...,**Electricity generation by Major Power Produc...,[{'page_content': 'Jul-22 3 Electricity gene...
4,How does AI impact skill levels in professiona...,[' assign each occupation to one of four skill...,Professional occupations at skill level 4 are ...,**Professional Occupations**: Professional occ...,[{'page_content': 'Figure 1 shows that profess...


In [14]:
df_endpoint.head()

Unnamed: 0,input,context,expected_output,actual_output,retrieval_context
0,How has the UK's role in supplying gas to Euro...,['BEIS - Monthly energy statistics briefing no...,The UK has been playing a key role in supplyin...,The UK's role in supplying gas to Europe has s...,[{'page_content': '3-monthly growth +55.8% +...
1,How has the share of renewable energy generati...,[' to move away from Russian gas.\n\n· Electri...,Renewable generation in the UK rose by 12% in ...,"In the second quarter of 2022, renewable gener...","[{'page_content': 'Simon, Stuart & Anouka cc: ..."
2,How has the share of renewable energy generati...,[' to move away from Russian gas.\n\n· Electri...,Renewable generation rose 12 per cent on the s...,**Renewable generation in the UK increased by ...,[{'page_content': 'Renewable generation rose 1...
3,How has electricity generation by Major Power ...,[' to move away from Russian gas.\n\n· Electri...,Electricity generation by Major Power Producer...,"Based on the provided documents, electricity g...",[{'page_content': 'Jul-22 3 Electricity gene...
4,What factors influenced the UK's gas productio...,[' Electricity Gas Aug-21 \n\nAug-22\n\n1\n\...,UK gas production increased by 56% in the past...,"In the past year, the UK's gas production incr...",[{'page_content': '3-monthly growth +55.8% +...


#### Remove rows containing NaN to prevent Pydantic validation errors

In [78]:
df_clean = df_endpoint.dropna()
df_clean.to_csv(f'{V_SYNTHETIC}/complete_ragas_synthetic_data.csv', index=False)

[Back to top](#title)

----

## Load Evaluation Dataset into test cases <a class="anchor" id="load-test-cases"></a>

Put the CSV file that you want to use for evaluation into `/notebooks/evaluation/data/synthetic_data/` directory

Import test cases from CSV

In [79]:
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_test_cases_from_csv_file(
    file_path=f'{V_SYNTHETIC}/complete_ragas_synthetic_data.csv',
    input_col_name="input",
    actual_output_col_name="actual_output",
    expected_output_col_name="expected_output",
    context_col_name="context",
    context_col_delimiter= ";",
    retrieval_context_col_name="retrieval_context",
    retrieval_context_col_delimiter= ";"
)

[Back to top](#title)

---------

## Evaluate RAG pipeline <a id="evaluate"></a>

DeepEval imports

In [80]:
from deepeval import evaluate
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
)

Instantiate retrieval and generation evaluation metrics

In [81]:
# Instantiate retrieval metrics
contextual_precision = ContextualPrecisionMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

contextual_recall = ContextualRecallMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

contextual_relevancy = ContextualRelevancyMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

In [82]:
# Instantiate generation metrics
answer_relevancy = AnswerRelevancyMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

faithfulness = FaithfulnessMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

hallucination = HallucinationMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

#### View test cases

In [83]:
dataset.test_cases

[LLMTestCase(input='How did changes to social security payments in the Netherlands Social Assistance Experiments impact employment rates and subjective wellbeing among participants?', actual_output='**The changes to social security payments in the Netherlands Social Assistance Experiments did not have a significant impact on employment rates among participants. However, the study found considerable improvements in mental wellbeing when the conditionality associated with traditional welfare payments was removed or replaced with more supportive interventions. Participants reported higher general satisfaction with life and confidence in their future.** \n\nSources: <Doced708774-780c-4374-af7c-e99df3dea0c4>', expected_output="Changes to social security payments in the Netherlands Social Assistance Experiments had positive treatment effects on subjective wellbeing for all three interventions compared to the control group. Additionally, a statistically significant treatment effect on partici

#### Retrieval Evaluation
Separate retrieval and generation evaluation results, as retrieval evalation can take some time

In [84]:
retrieval_eval_results = evaluate(
    test_cases=dataset,
    metrics=[
        contextual_precision,
        contextual_recall,
        contextual_relevancy,
    ]
)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()





Metrics Summary

  - ❌ Contextual Precision (score: 0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.00 because all the nodes in the retrieval context are irrelevant to the input. The first node discusses potential participants and changes in payments but lacks specific information about the impact on employment rates or subjective wellbeing. The second node mentions slight, non-significant increases in employment but does not address the changes in social security payments in the Netherlands Social Assistance Experiments or their impact on subjective wellbeing. The third node mentions mental wellbeing and financial security but is not specific to the Netherlands Social Assistance Experiments. The fourth node discusses improvements in mental wellbeing and satisfaction with life in Finland, which is unrelated to the input. Thus, all irrelevant nodes are ranked higher, resulting in a score of 0.00., error: None)
  - ❌ Contextual Recall (score: 0.333333



#### Save retrieval evaluation results

In [85]:
import pickle 
with open(f'{V_RESULTS}/retrieval_eval_results_v{DATA_VERSION}', 'wb') as f:
    pickle.dump(retrieval_eval_results, f)

In [61]:
with open(f'{V_RESULTS}/retrieval_eval_results_v{DATA_VERSION}', 'rb') as f:
    retrieval_eval_results = pickle.load(f)

#### Generation Evaluation

In [86]:
generation_eval_results = evaluate(
    test_cases=dataset,
    metrics=[
        answer_relevancy,
        faithfulness,
        hallucination
    ]
)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()





Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the answer is fully relevant and directly addresses the impact of changes to social security payments on employment rates and subjective wellbeing among participants in the Netherlands Social Assistance Experiments. Great job!, error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because there are no contradictions at all. Great job maintaining perfect alignment with the retrieval context! Keep up the excellent work!, error: None)
  - ✅ Hallucination (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the actual output contradicts the context provided., error: None)

For test case:

  - input: How did changes to social security payments in the Netherlands Social Assistance Experiments impact employment rates and subje



#### Save generation evaluation results

In [87]:
import pickle 
with open(f'{V_RESULTS}/generation_eval_results_v{DATA_VERSION}', 'wb') as f:
    pickle.dump(generation_eval_results, f)

In [60]:
with open(f'{V_RESULTS}/generation_eval_results_v{DATA_VERSION}', 'rb') as f:
    generation_eval_results = pickle.load(f)

[Back to top](#title)

---------------

## Analyse evaluation results <a id="analysis"></a>

#### Retrieval Evaluation Results

In [88]:
from dataclasses import asdict
import pandas as pd

results = (
    pd.DataFrame.from_records(
        asdict(result) for result in retrieval_eval_results
    )
    .explode("metrics")
    .reset_index(drop=True)
    .assign(
        metric_name = lambda df: df.metrics.apply(getattr, args=["__name__"]),
        score = lambda df: df.metrics.apply(getattr, args=["score"]),
        reason = lambda df: df.metrics.apply(getattr, args=["reason"])
    )
    .drop(columns=['success'])
)

results.to_csv(f'{V_RESULTS}/retrieval_eval_results_v{DATA_VERSION}.csv', index=False)
retrieval_eval_metrics = results
retrieval_eval_metrics.head()

Unnamed: 0,metrics,input,actual_output,expected_output,context,retrieval_context,metric_name,score,reason
0,<deepeval.metrics.contextual_precision.context...,How did changes to social security payments in...,**The changes to social security payments in t...,Changes to social security payments in the Net...,"[['\nOverall, at the end of the study, almost ...",[[{'page_content': 'changes to social security...,Contextual Precision,0.0,The score is 0.00 because all the nodes in the...
1,<deepeval.metrics.contextual_recall.contextual...,How did changes to social security payments in...,**The changes to social security payments in t...,Changes to social security payments in the Net...,"[['\nOverall, at the end of the study, almost ...",[[{'page_content': 'changes to social security...,Contextual Recall,0.333333,The score is 0.33 because while the retrieval ...
2,<deepeval.metrics.contextual_relevancy.context...,How did changes to social security payments in...,**The changes to social security payments in t...,Changes to social security payments in the Net...,"[['\nOverall, at the end of the study, almost ...",[[{'page_content': 'changes to social security...,Contextual Relevancy,1.0,The score is 1.00 because there are no reasons...
3,<deepeval.metrics.contextual_precision.context...,How does childhood poverty impact the likeliho...,**Childhood poverty** can impact the likelihoo...,The probability of obesity in young adults rec...,"[[', Guettabi & Reimer, 2019).\n\nSimilarly, i...","[[{'page_content': 'Similarly, in Western Caro...",Contextual Precision,1.0,"The score is 1.00 because the relevant node, w..."
4,<deepeval.metrics.contextual_recall.contextual...,How does childhood poverty impact the likeliho...,**Childhood poverty** can impact the likelihoo...,The probability of obesity in young adults rec...,"[[', Guettabi & Reimer, 2019).\n\nSimilarly, i...","[[{'page_content': 'Similarly, in Western Caro...",Contextual Recall,0.5,The score is 0.50 because the retrieval contex...


#### Generation Evaluation Results

In [89]:
from dataclasses import asdict
import pandas as pd

results = (
    pd.DataFrame.from_records(
        asdict(result) for result in generation_eval_results
    )
    .explode("metrics")
    .reset_index(drop=True)
    .assign(
        metric_name = lambda df: df.metrics.apply(getattr, args=["__name__"]),
        score = lambda df: df.metrics.apply(getattr, args=["score"]),
        reason = lambda df: df.metrics.apply(getattr, args=["reason"])
    )
    .drop(columns=['success'])
)

results.to_csv(f'{V_RESULTS}/generation_eval_results_v{DATA_VERSION}.csv', index=False)
generation_eval_metrics = results
generation_eval_metrics.head()

Unnamed: 0,metrics,input,actual_output,expected_output,context,retrieval_context,metric_name,score,reason
0,<deepeval.metrics.answer_relevancy.answer_rele...,How did changes to social security payments in...,**The changes to social security payments in t...,Changes to social security payments in the Net...,"[['\nOverall, at the end of the study, almost ...",[[{'page_content': 'changes to social security...,Answer Relevancy,1.0,The score is 1.00 because the answer is fully ...
1,<deepeval.metrics.faithfulness.faithfulness.Fa...,How did changes to social security payments in...,**The changes to social security payments in t...,Changes to social security payments in the Net...,"[['\nOverall, at the end of the study, almost ...",[[{'page_content': 'changes to social security...,Faithfulness,1.0,The score is 1.00 because there are no contrad...
2,<deepeval.metrics.hallucination.hallucination....,How did changes to social security payments in...,**The changes to social security payments in t...,Changes to social security payments in the Net...,"[['\nOverall, at the end of the study, almost ...",[[{'page_content': 'changes to social security...,Hallucination,0.0,The score is 1.00 because the actual output co...
3,<deepeval.metrics.answer_relevancy.answer_rele...,How does childhood poverty impact the likeliho...,**Childhood poverty** can impact the likelihoo...,The probability of obesity in young adults rec...,"[[', Guettabi & Reimer, 2019).\n\nSimilarly, i...","[[{'page_content': 'Similarly, in Western Caro...",Answer Relevancy,0.5,The score is 0.50 because while some relevant ...
4,<deepeval.metrics.faithfulness.faithfulness.Fa...,How does childhood poverty impact the likeliho...,**Childhood poverty** can impact the likelihoo...,The probability of obesity in young adults rec...,"[[', Guettabi & Reimer, 2019).\n\nSimilarly, i...","[[{'page_content': 'Similarly, in Western Caro...",Faithfulness,1.0,The score is 1.00 because there are no contrad...


#### Initial aggregated view

In [90]:
(
    retrieval_eval_metrics
    .groupby("metric_name")
    .mean("score")
)

Unnamed: 0_level_0,score
metric_name,Unnamed: 1_level_1
Contextual Precision,0.858333
Contextual Recall,0.675
Contextual Relevancy,0.9


In [91]:
(
    generation_eval_metrics
    .groupby("metric_name")
    .mean("score")
)

Unnamed: 0_level_0,score
metric_name,Unnamed: 1_level_1
Answer Relevancy,0.78
Faithfulness,0.896667
Hallucination,0.383333


[Back to top](#title)

-------