# Redbox RAG Evaluation

#### REDBOX-204: [SPIKE] Evaluate DeepEval as the LLM evaluation framework for Redbox

## Table of Contents
* [Overview](#first-section)
* [Generate Evaluation Dataset](#second-section)
* [Get responses from Redbox RAG endpoint](#third-section)
* [Run E2E Evaluation Metrics](#fourth-section)
* [Develop Custom Metrics](#fifth-section)
* [How to add to CI-CD pipeline](#sixth-section)

## Overview <a class="anchor" id="first-section"></a>

This notebook experiments with how we could use the DeepEval framework for both LLM/RAG unit testing (CI/CD) and RAG evalution in Redbox, and aims to get the user a little more familiar with the DeepEval framework

## Setup

As this is a spike, for now, do not mess with poetry set up and install deepeval into a fresh virtual environment

e.g. run the terminal commands below to:

- Deactivate the main project virtual environment
- Create and use a separate virtual environment for running this spike notebook

`source deactivate`

`cd notebooks/evaluation`

`pyenv virtualenv 3.11.8 eval`

`pyenv shell eval`

Restart vs code for eval virtualenv to be available as a kernel to run this notebook

#### Install DeepEval

In [None]:
!pip install -U deepeval

## Generate Evaluation Dataset <a class="anchor" id="second-section"></a>

We want to evalution our RAG application end-to-end. In order to do this we need to:
1. Generate a dataset from some of the documents I have access to
    - Try using DeepEval synthesizer for this (currently does not create expected_output)
    - *We can also use this [Hugging Face notebook](https://huggingface.co/learn/cookbook/en/rag_evaluation) to generate Q&A data and/or generated the expected_output (not done in this spike)*
    - Put the document(s) and all synthetically generated questions through the e2e Redbox `/rag` endpoint

### DeepEval Synthesizer

Use document(s) that we want to RAG over to generate Q&A pairs with relevant context - start simple with one doc.

Document used in this example is: Compass_BassicIncomeForAll_2019.pdf

Follow the steps below to get the chunks for the evaluation document(s):

1. Run app locally WITHOUT detached mode: `docker compose up elasticsearch kibana worker minio redis core-api db django-app`

2. View Swagger UI for /file endpoint at: `http://127.0.0.1:5002/file/docs`

3. Upload documents selected for evaluation

4. Take a note of the uuid(s), e.g. 7b550232-35c4-48fd-8d7a-ba364c1378c4 (this will change each time you run locally)

Chunking happens very quickly. Embedding takes more time, but will give you a boolean flag on complete.

5. From the Swagger UI, use the `file/{uuid}/status` endpoint to check status. Use the `uuid`s noted in step 4

6. From the Swagger UI use the `{file_uuid}/chunks` endpoint to get the chunks required for the next step of evaluation. Use the `uuid`s noted in step 4 to get chunks required for evaluation.

The complete output can be downloaded in JSON format from the Swagger UI docs page

7. Move downloaded response into `notebooks/evaluation/data_eval` folder

**Use [From Contexts](https://docs.confident-ai.com/docs/evaluation-datasets-synthetic-data#from-contexts) method in DeepEval synthesizer - this will ensure a more robust evaluation, with actual Redbox chunking mechanism used**

The `generate_goldens` method within the Synthesizer class allows for the creation of an evaluation dataset from a manually provided list of `contexts`, which is of type `list[list[str]]`.

This method directly transforms predefined textual contexts into inputs, which are then evolved. The evolved inputs form the basis of the goldens in your evaluation dataset.

In [5]:
# Load chunks created by Redbox
import json

# Define the path to the JSON file
file_path = "data_eval/response_1715091628456.json"

# Open the file and load the JSON data
with open(file_path, 'r') as f:
    data = json.load(f)

In [10]:
len(data)

66

In [18]:
# Define a list of contexts for synthetic data generation, by taking the text of each chunk from the JSON response
contexts = []
for i in range(len(data)):
    contexts.append([data[i]['text']])

In [26]:
from deepeval.synthesizer import Synthesizer
from deepeval.dataset import EvaluationDataset

# Initialize the Synthesizer
synthesizer = Synthesizer()

# contexts generated in cell above

# Generate goldens directly with the synthesizer
synthesizer.generate_goldens(contexts=contexts)
synthesizer.save_as(
    file_type='json',  # The method also supports 'csv'
    directory="./synthetic_data"
)

# Generate goldens within an EvaluationDataset
dataset = EvaluationDataset()
dataset.generate_goldens(
    synthesizer=synthesizer,
    contexts=contexts
)
dataset.save_as(
    file_type='json',  # Similarly, this supports 'csv'
    # directory="./synthetic_data"
    directory="./data_eval/synthetic_data"
)

Output()

Output()

Synthetic goldens saved at ./synthetic_data/20240507_173546.json!


Evaluation dataset saved at ./data_eval/synthetic_data/20240507_173642.json!


For the generate_goldens method in deepeval, the parameters are:

- contexts: a list of contexts, where each context is itself a list of strings sharing a common theme or subject area.
- [Optional] max_goldens_per_context: the maximum number of golden data points to be generated from each context. Adjusting this parameter can influence the size of the resulting dataset. Defaulted to 2.
- [Optional] num_evolutions: the number of evolution steps to apply to each generated input. This parameter controls the complexity and diversity of the generated dataset by iteratively refining and evolving the initial inputs. Default value is 1.
- [Optional] enable_breadth_evolve: a boolean indicating whether to enable breadth evolution strategies during data generation. When set to True, it introduces a wider variety of context modifications, enhancing the dataset's diversity. Default value is False.

### Review synthetically created dataset

In [2]:
# Load chunks created by Redbox
import json

# Define the path to the JSON file
file_path = "data_eval/synthetic_data/20240507_173642.json"

# Open the file and load the JSON data
with open(file_path, 'r') as f:
    dataset = json.load(f)

In [3]:
len(dataset)

396

Goldens do not have `expected_output` as it is not required or all metrics. These need to be generated when you create evaluation dataset

In [28]:
dataset[0]

{'input': "Compare the proposed short-term and long-term basic income schemes in Lansley and Reed's report.",
 'actual_output': None,
 'expected_output': None,
 'context': ['It is in this context of growing hardship and insecurity that this report by Stewart Lansley and Howard Reed is so important. It takes forward the debate about basic in- come by modelling both a shorter and longer-term scheme. The shorter-term scheme (model 1) provides for a partial basic income (PBI) designed to be implemented in a single parliament. I hope it will be taken seriously by the Labour Party and others. Rath- er than a limited pilot scheme, which is difficult in the context of a (rightly) centralised — 7 —\n\nBasic Income for All: From Desirability to Feasibility']}

In [30]:
dataset[10]

{'input': 'Compare the impact of social security cuts vs. welfare reforms on the UK social protection system.',
 'actual_output': None,
 'expected_output': None,
 'context': ['Flynn, Joel Flynn, Frances Foley, Barb Jacobson, Luke Martinelli, Anthony Painter, Peter Sloman, Alfie Stirling, Iva Tasseva, Malcolm Torry and Remco van der Stoep. The comments and feedback have been extremely useful; we apologise where we have not been able to embrace the full implications of all the many and detailed suggestions. Stewart Lansley and Howard Reed December 2018\n\n— 5 —\n\nForeword\n\na\n\nBasic Income for All: From Desirability to Feasibility done to the UK social protection system by years of public spending cutbacks targeted at people living in poverty.2 Moreover, social security cuts estimated at over £35bn a year by the early 2020s have been aggravated by ‘welfare reforms’ designed to change behaviour. The latter include an intensified and more extensive sanctions regime, de- scribed recentl

In [41]:
dataset[0]['input']

"Compare the proposed short-term and long-term basic income schemes in Lansley and Reed's report."

### Create An Evaluation Dataset

An `EvaluationDataset` in `deepeval` is simply a collection of `LLMTestCases` and/or `Goldens`.

**INFO**

A `Golden` is extremely very similar to an `LLMTestCase`, but they are more flexible as they do not require an `actual_output` at initialization. On the flip side, whilst test cases are always ready for evaluation, a golden isn't.

#### With Test_Cases (come back to this)

#### With Goldens

You should opt to initialize `EvaluationDatasets` with goldens if you're looking to generate LLM outputs at evaluation time. This usually means your original dataset does not contain precomputed outputs, but only the inputs you want to evaluate your LLM (application) on.

**This IS the case for us**

In [50]:
from deepeval.dataset import EvaluationDataset, Golden

first_golden = Golden(input=dataset[0]['input'])

goldens = [first_golden]

test_dataset = EvaluationDataset(goldens=goldens)

In [51]:
test_dataset

EvaluationDataset(test_cases=[], goldens=[Golden(input="Compare the proposed short-term and long-term basic income schemes in Lansley and Reed's report.", actual_output=None, expected_output=None, context=None, retrieval_context=None, additional_metadata=None, comments=None, source_file=None)], conversational_goldens=[], _alias=None, _id=None)

As of May 2024, `DeepEval`'s synthesizer only generated `input`, i.e. questions. It does not created `expected_output` - this will be coming in the next release.

#### TODO:

 Additionally, the synthesizer (I think) only generates the `input` from a single chunk. A `#TODO` would be to create `input` questions from multiple combinations of contexts, to generate more complex `input` questions that require more than one chunk to be retrieved

We need to also collect the context used to generate the `input` with the real context returned by Redbox Core API. Some metrics may require the context used by the synthesizer, so best to keep all

---------

****TODO:** Generate expected output using LLMs ([HuggingFace have a good notebook on this](https://huggingface.co/learn/cookbook/en/rag_evaluation))**

In [None]:
#TODO: Generate expected output for the test dataset

-----------

## Get responses from Redbox RAG endpoint <a class="anchor" id="third-section"></a>

### Upload document(s) that we want to RAG over - start simple with one doc

The evaluation docs should already have been uploaded to the locally running application, in the steps above. If not, do so now.

### Format each question into the required schema for the /rag endpoint

In [4]:
payloads = []

for i in range(len(dataset)):
    dict ={
      "message_history": [
    {
      "role": "system",
      "text": "You are a helpful AI Assistant"
    },
    {
      "role": "user",
      "text": dataset[i]['input']
    }
  ]
}
    payloads.append(dict)

### Calling /chat/rag endpoint

Use Python's concurrent.futures module to achieve parallel processing.

This script will send POST requests to the specified FastAPI endpoint (/chat/rag) with the provided JSON payload. The ThreadPoolExecutor is used to send these requests in parallel, with a maximum of 10 workers. The status code of each request is printed to the console.

Handle the responses in a more efficiently by storing them in a list and processing them after all requests have been made. This way, you can perform operations like counting the number of successful requests, logging failed requests, etc. Here's how you can modify your code:

In [5]:
import requests
import json
from concurrent.futures import ThreadPoolExecutor

def post_request(payload):
    url = "http://127.0.0.1:5002/chat/rag"
    headers = {'Content-Type': 'application/json'}
    response = requests.post(url, data=json.dumps(payload), headers=headers)
    return response.status_code, response.json()  # return status code and response body

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(post_request, payload) for payload in payloads]

responses = [future.result() for future in futures]  # store responses in a list

# Create a list to store status and body
status_and_body = [(status, body) for status, body in responses]

In [6]:
status_and_body[0]

(200,
 {'source_documents': [{'page_content': 'It is in this context of growing hardship and insecurity that this report by Stewart Lansley and Howard Reed is so important. It takes forward the debate about basic in- come by modelling both a shorter and longer-term scheme. The shorter-term scheme (model 1) provides for a partial basic income (PBI) designed to be implemented in a single parliament. I hope it will be taken seriously by the Labour Party and others. Rath- er than a limited pilot scheme, which is difficult in the context of a (rightly) centralised — 7 —\n\nBasic Income for All: From Desirability to Feasibility',
    'file_uuid': '7b550232-35c4-48fd-8d7a-ba364c1378c4',
    'page_numbers': None},
   {'page_content': 'initial basic income funded through the existing tax/benefit system – and a longer term step, with a citizens’ fund building over time to finance a more generous scheme. The combined ap- proach could be implemented well within a single generation. This approach p

Save respones

In [8]:
import pickle

# Save the status_and_body list to a file
with open('data_eval/rag_responses.pkl', 'wb') as f:
    pickle.dump(status_and_body, f)

Load from file

In [None]:
import pickle

# Load the status_and_body list from a file
with open('data_eval/rag_responses.pkl', 'rb') as f:
    status_and_body = pickle.load(f)

In [13]:
status_and_body[0][1]['source_documents']

[{'page_content': 'It is in this context of growing hardship and insecurity that this report by Stewart Lansley and Howard Reed is so important. It takes forward the debate about basic in- come by modelling both a shorter and longer-term scheme. The shorter-term scheme (model 1) provides for a partial basic income (PBI) designed to be implemented in a single parliament. I hope it will be taken seriously by the Labour Party and others. Rath- er than a limited pilot scheme, which is difficult in the context of a (rightly) centralised — 7 —\n\nBasic Income for All: From Desirability to Feasibility',
  'file_uuid': '7b550232-35c4-48fd-8d7a-ba364c1378c4',
  'page_numbers': None},
 {'page_content': 'initial basic income funded through the existing tax/benefit system – and a longer term step, with a citizens’ fund building over time to finance a more generous scheme. The combined ap- proach could be implemented well within a single generation. This approach provides a baseline income, and thu

In [14]:
dataset[0]

{'input': "Compare the proposed short-term and long-term basic income schemes in Lansley and Reed's report.",
 'actual_output': None,
 'expected_output': None,
 'context': ['It is in this context of growing hardship and insecurity that this report by Stewart Lansley and Howard Reed is so important. It takes forward the debate about basic in- come by modelling both a shorter and longer-term scheme. The shorter-term scheme (model 1) provides for a partial basic income (PBI) designed to be implemented in a single parliament. I hope it will be taken seriously by the Labour Party and others. Rath- er than a limited pilot scheme, which is difficult in the context of a (rightly) centralised — 7 —\n\nBasic Income for All: From Desirability to Feasibility']}

In [27]:
status_and_body[i][1]['source_documents']

[{'page_content': 'It is in this context of growing hardship and insecurity that this report by Stewart Lansley and Howard Reed is so important. It takes forward the debate about basic in- come by modelling both a shorter and longer-term scheme. The shorter-term scheme (model 1) provides for a partial basic income (PBI) designed to be implemented in a single parliament. I hope it will be taken seriously by the Labour Party and others. Rath- er than a limited pilot scheme, which is difficult in the context of a (rightly) centralised — 7 —\n\nBasic Income for All: From Desirability to Feasibility',
  'file_uuid': '7b550232-35c4-48fd-8d7a-ba364c1378c4',
  'page_numbers': None},
 {'page_content': 'initial basic income funded through the existing tax/benefit system – and a longer term step, with a citizens’ fund building over time to finance a more generous scheme. The combined ap- proach could be implemented well within a single generation. This approach provides a baseline income, and thu

In [28]:
page_contents = [d['page_content'] for d in status_and_body[0][1]['source_documents']]

In [29]:
page_contents

['It is in this context of growing hardship and insecurity that this report by Stewart Lansley and Howard Reed is so important. It takes forward the debate about basic in- come by modelling both a shorter and longer-term scheme. The shorter-term scheme (model 1) provides for a partial basic income (PBI) designed to be implemented in a single parliament. I hope it will be taken seriously by the Labour Party and others. Rath- er than a limited pilot scheme, which is difficult in the context of a (rightly) centralised — 7 —\n\nBasic Income for All: From Desirability to Feasibility',
 'initial basic income funded through the existing tax/benefit system – and a longer term step, with a citizens’ fund building over time to finance a more generous scheme. The combined ap- proach could be implemented well within a single generation. This approach provides a baseline income, and thus a bedrock of security in an increasingly insecure world that boosts personal freedom and extends choices about w

In [23]:
for i in range(len(dataset)):
    dataset[i]['retrieved_context'] = status_and_body[i][1]['source_documents']['page_content']

TypeError: list indices must be integers or slices, not str

In [16]:
for i in range(len(dataset)):
    dataset[i]['retrieved_context'] = status_and_body[i][1]['source_documents']

{'input': "Compare the proposed short-term and long-term basic income schemes in Lansley and Reed's report.",
 'actual_output': None,
 'expected_output': None,
 'context': ['It is in this context of growing hardship and insecurity that this report by Stewart Lansley and Howard Reed is so important. It takes forward the debate about basic in- come by modelling both a shorter and longer-term scheme. The shorter-term scheme (model 1) provides for a partial basic income (PBI) designed to be implemented in a single parliament. I hope it will be taken seriously by the Labour Party and others. Rath- er than a limited pilot scheme, which is difficult in the context of a (rightly) centralised — 7 —\n\nBasic Income for All: From Desirability to Feasibility'],
 'retrieved_context': [{'page_content': 'It is in this context of growing hardship and insecurity that this report by Stewart Lansley and Howard Reed is so important. It takes forward the debate about basic in- come by modelling both a shor

In [17]:
for i in range(len(dataset)):
    dataset[i]['actual_output'] = status_and_body[i][1]['output_text']

In [18]:
dataset[0]

{'input': "Compare the proposed short-term and long-term basic income schemes in Lansley and Reed's report.",
 'actual_output': "The proposed short-term basic income scheme in the Lansley and Reed report involves paying £60 per week to adults aged 18-64, £40 per week to mothers for each child aged 0-17, and £175 per week to adults aged 65 and above with residency eligibility. It would abolish child benefit and the state pension but retain other parts of the existing social security system, including means-tested benefits. The net cost is estimated to be £28bn. The long-term scheme involves building a citizens' fund over time to finance a more generous scheme that provides a baseline income and boosts personal freedom. Both schemes aim to reduce poverty and promote social security reform.\n\n**Short-term Basic Income Scheme (Model 1):**\n- Weekly payments: £60 for adults 18-64, £40 for children 0-17, £175 for adults 65+\n- Abolishes child benefit and state pension\n- Costs estimated at 

## Evaluating Retrieval <a class="anchor" id="first-section"></a>

Which context should we to use (the one used for generating sythetic data or the returned context from Redbox /chat/rag) --> You should use the context returned by your RAG application!

In [20]:
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric
)

contextual_precision = ContextualPrecisionMetric()
contextual_recall = ContextualRecallMetric()
contextual_relevancy = ContextualRelevancyMetric()



Then, define a test case. Note that deepeval gives you the flexibility to either begin evaluating with complete datasets, or perform the retrieval and generation at evaluation time.



In [30]:
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(
    input=dataset[0]['input'],
    actual_output=dataset[0]['actual_output'],
    #TODO: need expected output
    retrieval_context=page_contents # Needs to be None or a list of strings
)

In [34]:
from deepeval import evaluate
...

evaluate(
    test_cases=[test_case],
    metrics=[contextual_relevancy]
    # metrics=[contextual_precision, contextual_recall, contextual_relevancy]
)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ✅ Contextual Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4-turbo, reason: The score is 1.00 because the retrieval context perfectly aligns with the input, focusing exactly on comparing short-term and long-term basic income schemes as requested., error: None)

For test case:

  - input: Compare the proposed short-term and long-term basic income schemes in Lansley and Reed's report.
  - actual output: The proposed short-term basic income scheme in the Lansley and Reed report involves paying £60 per week to adults aged 18-64, £40 per week to mothers for each child aged 0-17, and £175 per week to adults aged 65 and above with residency eligibility. It would abolish child benefit and the state pension but retain other parts of the existing social security system, including means-tested benefits. The net cost is estimated to be £28bn. The long-term scheme involves building a citizens' fund over time to finance a more generous scheme that



[TestResult(success=True, metrics=[<deepeval.metrics.contextual_relevancy.contextual_relevancy.ContextualRelevancyMetric object at 0x12c8bc2d0>], input="Compare the proposed short-term and long-term basic income schemes in Lansley and Reed's report.", actual_output="The proposed short-term basic income scheme in the Lansley and Reed report involves paying £60 per week to adults aged 18-64, £40 per week to mothers for each child aged 0-17, and £175 per week to adults aged 65 and above with residency eligibility. It would abolish child benefit and the state pension but retain other parts of the existing social security system, including means-tested benefits. The net cost is estimated to be £28bn. The long-term scheme involves building a citizens' fund over time to finance a more generous scheme that provides a baseline income and boosts personal freedom. Both schemes aim to reduce poverty and promote social security reform.\n\n**Short-term Basic Income Scheme (Model 1):**\n- Weekly paym

## Evaluating Generation <a class="anchor" id="first-section"></a>

The generation metrics are included in the E2E run below and include RAGASFaithfulnessMetric & RAGASAnswerRelevancyMetric

## Run E2E Evaluation Metrics <a class="anchor" id="fourth-section"></a>

In [None]:
# test_rag.py
import pytest
from deepeval import assert_test
from deepeval.metrics.ragas import (
    RAGASContextualPrecisionMetric,
    RAGASFaithfulnessMetric,
    RAGASContextualRecallMetric,
    RAGASAnswerRelevancyMetric,
)
from deepeval.metrics import BiasMetric
from deepeval.test_case import LLMTestCase

#######################################
# Initialize metrics with thresholds ##
#######################################
bias = BiasMetric(threshold=0.5)
contextual_precision = RAGASContextualPrecisionMetric(threshold=0.5)
contextual_recall = RAGASContextualRecallMetric(threshold=0.5)
answer_relevancy = RAGASAnswerRelevancyMetric(threshold=0.5)
faithfulness = RAGASFaithfulnessMetric(threshold=0.5)

#######################################
# Specify evaluation metrics to use ###
#######################################
evaluation_metrics = [
  bias,
  contextual_precision,
  contextual_recall,
  answer_relevancy,
  faithfulness
]

#######################################
# Specify inputs to test RAG app on ###
#######################################
input_output_pairs = [
  {
    "input": "",
    "expected_output": "", 
  },
  {
    "input": "",
    "expected_output": "", 
  }
]

#######################################
# Loop through input output pairs #####
#######################################
@pytest.mark.parametrize(
    "input_output_pair",
    input_output_pairs,
)
def test_llamaindex(input_output_pair: Dict):
    input = input_output_pair.get("input", None)
    expected_output = input_output_pair.get("expected_output", None)

    # Hypothentical RAG application for demonstration only. 
    # Replace this with your own RAG implementation.
    # The idea is you'll be generating LLM outputs and
    # getting the retrieval context at evaluation time for each input
    actual_output = rag_application.query(input)
    retrieval_context = rag_application.get_retrieval_context()

    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output,
        retrieval_context=retrieval_context,
        expected_output=expected_output
    )
    # assert test case
    assert_test(test_case, evaluation_metrics)

Execute the test file via the CLI:



`deepeval test run test_rag.py`

## Develop Custom Metrics <a class="anchor" id="fifth-section"></a>

(For future): Develop custom metrics for the Cabbinet Office domain

## How to add to CI-CD pipeline <a class="anchor" id="sixth-section"></a>

Just skeleton code for now to get feedback from the team

All that is needed is to include the deepeval test run command to your CI/CD environment. Using GitHub Actions as an example, here’s an example of how you can add DeepEval to your GitHub workflows YAML files:

In [35]:
"""
name: RAG Deployment Evaluations

on:
	push:
  
jobs:  
	test:    
  	runs-on: ubuntu-latest
		steps:
    	# Some extra steps to setup and install dependencies
    	...
      
       # Optional Login
     	- name: Login to Confident
        env:
          CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}
        run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"
        
      - name: Run deepeval tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: poetry run deepeval test run test_rag.py
"""

'\nname: RAG Deployment Evaluations\n\non:\n\tpush:\n  \njobs:  \n\ttest:    \n  \truns-on: ubuntu-latest\n\t\tsteps:\n    \t# Some extra steps to setup and install dependencies\n    \t...\n      \n       # Optional Login\n     \t- name: Login to Confident\n        env:\n          CONFIDENT_API_KEY: ${{ secrets.CONFIDENT_API_KEY }}\n        run: poetry run deepeval login --confident-api-key "$CONFIDENT_API_KEY"\n        \n      - name: Run deepeval tests\n        env:\n          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}\n        run: poetry run deepeval test run test_rag.py\n'