# RAG

This work will look at the implementation of RAG within NHS England. This notebook contains a simple RAG pipeline which can work with both RAG turned on, and RAG turned off (relying only on the models innate "knowledge"). 

## Setup

In [2]:
import glob
import os
import pandas as pd
import random


import toml
from dotenv import load_dotenv

import src.models as models

from tqdm import tqdm

config = toml.load("config.toml")
load_dotenv(".secrets")
os.environ["ANTHROPIC_API_KEY"] = os.getenv("anthropic_key")

if config['DEV_MODE']:
    config['PERSIST_DIRECTORY'] += "/dev"


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


First we initialise the RAG pipeline - this is an object which links the vector-store, and the LLM, so when you pass a query in it get passed back into the database, and then returns the response.

There are also methods for adding documents to the database.

In [3]:
rag_pipeline = models.RagPipeline(config['EMBEDDING_MODEL'], config['PERSIST_DIRECTORY'])



need to fill the database if it's empty (this might take 5 mins or so the first time, unless you've got a nice graphics card!)

In [4]:
# Add documents if there are non - if in DEV mode, don't add any more (if it's not empty)
if len(rag_pipeline.vectorstore.get()['documents']) == 0 or (not config['DEV_MODE']):
    rag_pipeline.load_documents()  

## Generating Response from Cogstack Questions

In [12]:
#load cogstack question and answers
cogstack_qa = pd.read_csv('src/model_eval/cogstack_qa_data_process.csv')

sample_qa = cogstack_qa.sample(n = 10, random_state = 1234)

sample_qa


Unnamed: 0.1,Unnamed: 0,question,answer,reference,short_reference
14767,14767,What is a completion lymph node dissection?,An operation to remove the remaining lymph nod...,https://www.nhs.uk/conditions/melanoma-skin-ca...,melanoma-skin-cancer
6653,6653,What is NHS COVID Pass?,The NHS COVID Pass is a certification system t...,https://www.nhs.uk/conditions/coronavirus-covi...,coronavirus-covid-19
11645,11645,What happens after a heart transplant assessment?,"After the assessment, a final decision is made...",https://www.nhs.uk/conditions/heart-transplant...,heart-transplant
2817,2817,Can the blood spot test result be false positive?,"Yes, a small number of babies will screen posi...",https://www.nhs.uk/conditions/baby/newborn-scr...,baby
1885,1885,What kind of brain scans are used to check for...,The two most widely used brain imaging scans t...,https://www.nhs.uk/conditions/ataxia/diagnosis/,ataxia
9495,9495,Can prosopagnosia affect mental health?,"Yes, difficulty recognizing faces may make it ...",https://www.nhs.uk/conditions/face-blindness/,face-blindness
10385,10385,How effective is antifungal nail cream?,There is no guarantee that antifungal nail cre...,https://www.nhs.uk/conditions/fungal-nail-infe...,fungal-nail-infection
4226,4226,Do I need to get medical help for broken or br...,Get advice from 111 now if your pain has not i...,https://www.nhs.uk/conditions/broken-or-bruise...,broken-or-bruised-ribs
13792,13792,"What is a nasendoscopy, and how is it performed?",A nasendoscopy is a medical procedure used to ...,https://www.nhs.uk/conditions/laryngeal-cancer...,laryngeal-cancer
3708,3708,What is a colonoscopy?,A colonoscopy is a test that involves passing ...,https://www.nhs.uk/conditions/bowel-cancer-scr...,bowel-cancer-screening


Now we will run with  **RAG** turned on. You'll see it spits out a bunch of stuff, as it was set to be verbose - namely, it gives back the completed prompt it submitted to the LLM, followed by the answer - you can see the chunks of documents it found.

In [13]:

#run the question prompt through the llm with and without rag recording the responses
rag_on = []
rag_off = []
llm_references = []

for index, row in sample_qa.iterrows():
    #retrieve question answer and references from df
    cogstack_q = row['question']
    cogstack_a = row['answer']

    #run question prompt through LLM with rag = False
    result_rag_off = rag_pipeline.answer_question(cogstack_q, rag=False)
    rag_off.append(result_rag_off)

    #run question prompt through LLM with rag = True
    #separate by word and extract reference and generated response
    
    bad_answer = True
    loop_count = 0
    while bad_answer:
        result_rag_on = rag_pipeline.answer_question(cogstack_q, rag=True)
        if loop_count >= 4:
            break
        elif not result_rag_on.split():
            #retry if result is empty
            pass
        elif 'SOURCES:' not in result_rag_on.split():
            #retry if no source is referenced
            pass
        elif len(result_rag_on.split()) <= (len(result_rag_off.split()) - result_rag_on.split().index('SOURCES:')):
            #retry if output only contains sources 
            pass
        else:
            bad_answer = False
        loop_count += 1
        

    source_idx = result_rag_on.split().index('SOURCES:')
    llm_rag_on_response = ' '.join(result_rag_on.split()[:source_idx])
    llm_ref = ' '.join(result_rag_on.split()[source_idx + 1:])
    #append generated response and corresponding reference
    rag_on.append(llm_rag_on_response)
    llm_references.append(llm_ref)

    




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humansmake more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.Be concise and professional in your responses.

 Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre

In [16]:
sample_qa['rag_off'] = rag_off
sample_qa['rag_on'] = rag_on
sample_qa['llm_reference'] = llm_references
sample_qa

Unnamed: 0.1,Unnamed: 0,question,answer,reference,short_reference,rag_off,rag_on,llm_reference
14767,14767,What is a completion lymph node dissection?,An operation to remove the remaining lymph nod...,https://www.nhs.uk/conditions/melanoma-skin-ca...,melanoma-skin-cancer,\nA completion lymph node dissection (CLND) is...,"Unfortunately, I could not find the definition...",(none)
6653,6653,What is NHS COVID Pass?,The NHS COVID Pass is a certification system t...,https://www.nhs.uk/conditions/coronavirus-covi...,coronavirus-covid-19,\nThe NHS COVID Pass is a digital pass that co...,Based on the information provided in the docum...,"(covid-19.txt, vaccinations.txt, flu.txt, hepa..."
11645,11645,What happens after a heart transplant assessment?,"After the assessment, a final decision is made...",https://www.nhs.uk/conditions/heart-transplant...,heart-transplant,"\nAfter a heart transplant assessment, a few t...","After a heart transplant assessment, if the do...",(heart-transplant.txt)
2817,2817,Can the blood spot test result be false positive?,"Yes, a small number of babies will screen posi...",https://www.nhs.uk/conditions/baby/newborn-scr...,baby,"\nYes, false positive blood spot tests for var...","Yes, it is possible to get a false positive re...",(nhs-screening.txt)
1885,1885,What kind of brain scans are used to check for...,The two most widely used brain imaging scans t...,https://www.nhs.uk/conditions/ataxia/diagnosis/,ataxia,\n- MRI scan (Magnetic Resonance Imaging): Thi...,MRI scans and CT scans can be used to check fo...,"(pet-scan.txt, mri-scan.txt, hydrocephalus.txt..."
9495,9495,Can prosopagnosia affect mental health?,"Yes, difficulty recognizing faces may make it ...",https://www.nhs.uk/conditions/face-blindness/,face-blindness,"\nProsopagnosia, or ""face blindness,"" can inde...","Based on the provided documents, prosopagnosia...",(docs\face-blindness.txt)
10385,10385,How effective is antifungal nail cream?,There is no guarantee that antifungal nail cre...,https://www.nhs.uk/conditions/fungal-nail-infe...,fungal-nail-infection,\n- Antifungal nail creams contain ingredients...,"Based on the provided documents, antifungal na...","(docs\fungal-nail-infection.txt, docs\antifung..."
4226,4226,Do I need to get medical help for broken or br...,Get advice from 111 now if your pain has not i...,https://www.nhs.uk/conditions/broken-or-bruise...,broken-or-bruised-ribs,\n- Broken ribs can be serious and require med...,"Based on the information provided, you should ...",(broken-or-bruised-ribs.txt)
13792,13792,"What is a nasendoscopy, and how is it performed?",A nasendoscopy is a medical procedure used to ...,https://www.nhs.uk/conditions/laryngeal-cancer...,laryngeal-cancer,\nA nasendoscopy is an endoscopic examination ...,Nasendoscopy refers to an endoscopy procedure ...,"(endoscopy.txt, bioposy.txt)"
3708,3708,What is a colonoscopy?,A colonoscopy is a test that involves passing ...,https://www.nhs.uk/conditions/bowel-cancer-scr...,bowel-cancer-screening,\nA colonoscopy is a medical test where a doct...,A colonoscopy is a test to check inside your b...,"(colonoscopy.txt, bowel-polyps.txt)"


<h1>Evaluation and Comparisons</h1>

1. Is llm response same as the cogstack answer (for rag on and rag off) - using langchain scoring template with reference
2. Is the llm reference same as or include the cogstack reference
3. Using OpenAI model to judge the responses from our Claude model

<h3>1.1 - Using Langchain Scoring Evaluator with Default Criteria</h3>
Scoring the responses from the LLM with the ground truth when RAG is turned on and off

In [17]:
#load in default langchain scoring evaluator
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("labeled_score_string", llm=rag_pipeline.llm)

In [18]:
# evaluate rag_on and rag_off responses with the ground truth and record the score and reasoning
rag_on_scores = []
rag_off_scores = []
rag_on_exp = []
rag_off_exp = []

for idx in range(sample_qa.shape[0]):
    #get an evaluation score using default criterion for rag on and rag off responses
    eval_result_rag_on = evaluator.evaluate_strings(
        prediction= sample_qa.iloc[idx]['rag_on'],
        reference= sample_qa.iloc[idx]['answer'],
        input= sample_qa.iloc[idx]['question'])
    
    eval_result_rag_off = evaluator.evaluate_strings(
        prediction=sample_qa.iloc[idx]['rag_off'],
        reference=sample_qa.iloc[idx]['answer'],
        input=sample_qa.iloc[idx]['question'])
    
    #append the scores and reasoning to the respective lists to be added as columns
    rag_on_scores.append(eval_result_rag_on['score'])
    rag_on_exp.append(eval_result_rag_on['reasoning'])

    rag_off_scores.append(eval_result_rag_off['score'])
    rag_off_exp.append(eval_result_rag_off['reasoning'])

    

ValueError: Invalid output: <|endoftext|>. Output must contain a double bracketed string                 with the verdict between 1 and 10.

In [10]:
#create a copy of sample qa and append default score and explanation columns
df_default_scores = sample_qa.copy()

#append scores
df_default_scores['rag_on_scores_default'] = rag_on_scores
df_default_scores['rag_off_scores_default'] = rag_off_scores

#append explanations
df_default_scores['rag_on_explanations_default'] = rag_on_exp
df_default_scores['rag_off_explanations_default'] = rag_off_exp

ValueError: Length of values (8) does not match length of index (10)

In [None]:
df_default_scores

Unnamed: 0.1,Unnamed: 0,question,answer,reference,short_reference,rag_off,rag_on,llm_reference,rag_on_scores_default,rag_off_scores_default,rag_on_explanations_default,rag_off_explanations_default
5953,5953,When should the Court of Protection be involve...,The Court of Protection must be involved in de...,https://www.nhs.uk/conditions/consent-to-treat...,consent-to-treatment,\n- The Court of Protection is a specialist co...,The Court of Protection should be involved in ...,(docs\do-not-attempt-cardiopulmonary-resuscita...,4,7,\nRating: [[4]]\n\nThe assistant provided a mo...,\nThe response provides a helpful overview of ...
7787,7787,What should I do if I'm worried about someone ...,Encourage them to make an appointment with a G...,https://www.nhs.uk/conditions/dementia-with-le...,dementia-with-lewy-bodies,\n- Educate yourself about dementia with Lewy ...,If you're worried about someone else who may h...,(dementia-with-lewy-bodies.txt),8,8,\nRating: [[8]]\n\nThe assistant provides a he...,\nThe response from the AI assistant is helpfu...
16298,16298,"What is radiotherapy, and how is it used to tr...",Radiotherapy involves using low doses of radia...,https://www.nhs.uk/conditions/non-melanoma-ski...,non-melanoma-skin-cancer,Radiotherapy (or radiation therapy) utilizes h...,Radiotherapy is a treatment where radiation is...,"(radiotherapy.txt, non-melanoma-skin-cancer.txt)",8,8,\nRating: [[8]]\n\nI have given the submission...,\nThe submission provides a helpful and releva...
6371,6371,What should you avoid after a cornea transplan...,"During the first weeks after surgery, avoid ru...",https://www.nhs.uk/conditions/cornea-transplan...,cornea-transplant,\nHere are some of the main things to avoid af...,"After a cornea transplant surgery, you should ...",(cornea-transplant.txt),7,9,\nExplanation: The assistant gives a helpful o...,"\nThe response provides helpful, relevant, and..."
18995,18995,What is pudendal neuralgia?,Pudendal neuralgia is a long-term pelvic pain ...,https://www.nhs.uk/conditions/pudendal-neuralgia/,pudendal-neuralgia,\nPudendal neuralgia is a painful condition af...,Pudendal neuralgia is nerve pain in the genita...,(pudendal-neuralgia.txt),6,9,\n[rating]] [rating]] [[6]]\n\nThe assistant p...,\n[Expert's evaluation]\nThe assistant provide...
5962,5962,Who can give consent for a child's medical tre...,Someone with parental responsibility can conse...,https://www.nhs.uk/conditions/consent-to-treat...,consent-to-treatment,\n- If a child is unable to consent to their o...,If a child under the age of 16 is unable to gi...,(consent-to-treatment.txt),8,4,\nExplanation: The assistant's response correc...,\nRating: [[4]]\n\nThe assistant gives a gener...
18961,18961,What kinds of mental health information does t...,The Royal College of Psychiatrists (RCPsych) h...,https://www.nhs.uk/conditions/psychiatry/,psychiatry,\nThe Royal College of Psychiatrists is a prof...,Unfortunately there is no clear answer in the ...,"(rett-syndrome.txt, frontotemporal-dementia.tx...",2,8,\nThe assistant's response does not actually a...,\nThe assistant's answer is relevant to the qu...
23880,23880,Are there any specific symptoms that may resul...,"Yes, if you experience symptoms such as bleedi...",https://www.nhs.uk/conditions/vaginal-cancer/s...,vaginal-cancer,\nIf a woman has any of the following symptoms...,"According to the documents, some of the sympto...","(vaginal-cancer.txt, vulval-cancer.txt, cervic...",8,8,\nThe assistant's response accurately referenc...,\nExplanation: The assistant's response outlin...
11414,11414,When should I call 111 or get an urgent GP app...,You should get an urgent GP appointment or cal...,https://www.nhs.uk/conditions/headache/,headache,\nCall 111 or get an urgent GP appointment for...,Get an urgent GP appointment or call 111 if: -...,"(headaches.txt, cluster-headaches.txt, double-...",9,8,"\nThe AI assistant provides a helpful, compreh...",\nExplanation: The AI assistant provided a com...
20159,20159,What can be done at home to manage sickle cell...,"If you have a sickle cell crisis, you can usua...",https://www.nhs.uk/conditions/sickle-cell-dise...,sickle-cell-disease,"\n1. Drink plenty of fluids like water, non-ca...","To help manage sickle cell crisis at home, the...",(sickle-cell-disease.txt),10,8,\nThe quality of the response is excellent. It...,\nThe main parts of the answer covered helpful...


<h3>1.2 - Seperate Scoring for Each Criteria</h3>
Scoring the responses from the LLM with the ground truth when RAG is turned on and off providing a separate score for custom criteria

In [19]:
# This is equivalent to loading using the enum
from langchain.evaluation import EvaluatorType

In [21]:
from langchain.prompts import PromptTemplate

fstring = """Respond Y or N based on how well the following response follows the specified rubric. Grade only based on the rubric and expected response:

Grading Rubric: {criteria}
Expected Response: {reference}

DATA:
---------
Question: {input}
Response: {output}
---------
Please act as an impartial judge
and evaluate the quality of the response provided by an AI 
assistant to the user question displayed below based on the following criteria: {criteria}. Begin your evaluation 
by providing a short explanation. Be as objective as possible. 
After providing your explanation, you must rate the response on a scale of 1 to 10 
by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".\n\n\
[Question]\n{input}\n\n[The Start of Assistant\'s Answer]\n{prediction}\n"""

prompt = PromptTemplate.from_template(fstring)

evaluator = load_evaluator("labeled_criteria", criteria="correctness", prompt=prompt, llm = rag_pipeline.llm)

ValueError: Input variables should be {'reference', 'output', 'input', 'criteria'}, but got ['criteria', 'input', 'output', 'prediction', 'reference']

In [88]:
from langchain.evaluation import Criteria
list_of_criteria = list(Criteria)
print(list_of_criteria)

[<Criteria.CONCISENESS: 'conciseness'>, <Criteria.RELEVANCE: 'relevance'>, <Criteria.CORRECTNESS: 'correctness'>, <Criteria.COHERENCE: 'coherence'>, <Criteria.HARMFULNESS: 'harmfulness'>, <Criteria.MALICIOUSNESS: 'maliciousness'>, <Criteria.HELPFULNESS: 'helpfulness'>, <Criteria.CONTROVERSIALITY: 'controversiality'>, <Criteria.MISOGYNY: 'misogyny'>, <Criteria.CRIMINALITY: 'criminality'>, <Criteria.INSENSITIVITY: 'insensitivity'>, <Criteria.DEPTH: 'depth'>, <Criteria.CREATIVITY: 'creativity'>, <Criteria.DETAIL: 'detail'>]


In [98]:
#initialise criteria dictionary to hold all the evaluation scores
criteria_dict = {'question':[]}
for i in list_of_criteria:
    criteria_dict[i.value] = []

<h3>Running evaluation on responses from LLM with RAG on</h3>

In [99]:
#run the evaluation score for each criteria and append the score to the dictionary
#add each question to the list in the dictionary
for i in list(criteria_dict.keys())[1:]:
    print('Now evaluating the response based on: {}'.format(i))
    for idx in range(sample_qa.shape[0]):
        evaluator_crit = load_evaluator("labeled_criteria", criteria=i, llm=rag_pipeline.llm)
        eval_result_rag_on_crit = evaluator_crit.evaluate_strings(
                input = sample_qa.iloc[idx]['question'],
                prediction= sample_qa.iloc[idx]['rag_on'],
                reference= sample_qa.iloc[idx]['answer'])
        #add the corresponding score for rag_on
        criteria_dict[i].append(eval_result_rag_on_crit['score'])
        print('The LLM scored: {} for question {}'.format(eval_result_rag_on_crit['score'], idx+1))

Now evaluating the response based on: conciseness
The LLM scored: 1 for question 1
The LLM scored: 0 for question 2
The LLM scored: 0 for question 3
The LLM scored: 1 for question 4
The LLM scored: 0 for question 5
The LLM scored: 1 for question 6
The LLM scored: 0 for question 7
The LLM scored: 1 for question 8
The LLM scored: 1 for question 9
The LLM scored: 1 for question 10
Now evaluating the response based on: relevance
The LLM scored: 0 for question 1
The LLM scored: 1 for question 2
The LLM scored: 0 for question 3
The LLM scored: 1 for question 4
The LLM scored: 0 for question 5
The LLM scored: 1 for question 6
The LLM scored: 0 for question 7
The LLM scored: None for question 8
The LLM scored: 1 for question 9
The LLM scored: 0 for question 10
Now evaluating the response based on: correctness
The LLM scored: None for question 1
The LLM scored: 1 for question 2
The LLM scored: 1 for question 3
The LLM scored: 0 for question 4
The LLM scored: None for question 5
The LLM scored: 

In [104]:
criteria_dict

{'question': [],
 'conciseness': [1, 0, 0, 1, 0, 1, 0, 1, 1, 1],
 'relevance': [0, 1, 0, 1, 0, 1, 0, None, 1, 0],
 'correctness': [None, 1, 1, 0, None, None, 0, 1, 1, 1],
 'coherence': [1, 1, None, 1, 1, 1, 1, 1, 0, 0],
 'harmfulness': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'maliciousness': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'helpfulness': [1, 1, 1, 1, 1, 1, 0, 1, 1, 0],
 'controversiality': [0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 'misogyny': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'criminality': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'insensitivity': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'depth': [0, 1, 1, 0, 1, 0, 0, 1, None, 0],
 'creativity': [0, 0, 1, 0, 0, 0, 0, 0, 1, 0],
 'detail': [0, 1, 1, 0, 1, 0, 0, 1, 1, 0]}

In [105]:
for i in sample_qa['question']:
    criteria_dict['question'].append(i)

In [108]:
df_criteria_scores_rag_on = pd.DataFrame(criteria_dict)
df_criteria_scores_rag_on

Unnamed: 0,question,conciseness,relevance,correctness,coherence,harmfulness,maliciousness,helpfulness,controversiality,misogyny,criminality,insensitivity,depth,creativity,detail
0,When should the Court of Protection be involve...,1,0.0,,1.0,0,0,1,0,0,0,0,0.0,0,0
1,What should I do if I'm worried about someone ...,0,1.0,1.0,1.0,0,0,1,0,0,0,0,1.0,0,1
2,"What is radiotherapy, and how is it used to tr...",0,0.0,1.0,,0,0,1,0,0,0,0,1.0,1,1
3,What should you avoid after a cornea transplan...,1,1.0,0.0,1.0,0,0,1,0,0,0,0,0.0,0,0
4,What is pudendal neuralgia?,0,0.0,,1.0,0,0,1,0,0,0,0,1.0,0,1
5,Who can give consent for a child's medical tre...,1,1.0,,1.0,0,0,1,0,0,0,0,0.0,0,0
6,What kinds of mental health information does t...,0,0.0,0.0,1.0,0,0,0,0,0,0,0,0.0,0,0
7,Are there any specific symptoms that may resul...,1,,1.0,1.0,0,0,1,0,0,0,0,1.0,0,1
8,When should I call 111 or get an urgent GP app...,1,1.0,1.0,0.0,0,0,1,1,0,0,0,,1,1
9,What can be done at home to manage sickle cell...,1,0.0,1.0,0.0,0,0,0,0,0,0,0,0.0,0,0


<h3>Running evaluation on responses from LLM with RAG off</h3>

In [110]:
#initialise criteria dictionary to hold all the evaluation scores
criteria_dict_rag_off = {'question':[]}
for i in list_of_criteria:
    criteria_dict_rag_off[i.value] = []

In [111]:
#run the evaluation score for each criteria and append the score to the dictionary
#add each question to the list in the dictionary
for i in list(criteria_dict_rag_off.keys())[1:]:
    print('Now evaluating the response based on: {}'.format(i))
    for idx in range(sample_qa.shape[0]):
        evaluator_crit = load_evaluator("labeled_criteria", criteria=i, llm=rag_pipeline.llm)
        eval_result_rag_off_crit = evaluator_crit.evaluate_strings(
                input = sample_qa.iloc[idx]['question'],
                prediction= sample_qa.iloc[idx]['rag_off'],
                reference= sample_qa.iloc[idx]['answer'])
        #add the corresponding score for rag_off
        criteria_dict_rag_off[i].append(eval_result_rag_off_crit['score'])
        print('The LLM scored: {} for question {}'.format(eval_result_rag_off_crit['score'], idx+1))

Now evaluating the response based on: conciseness
The LLM scored: 1 for question 1
The LLM scored: 0 for question 2
The LLM scored: 0 for question 3
The LLM scored: 1 for question 4
The LLM scored: 0 for question 5
The LLM scored: 1 for question 6
The LLM scored: 1 for question 7
The LLM scored: 0 for question 8
The LLM scored: None for question 9
The LLM scored: 0 for question 10
Now evaluating the response based on: relevance
The LLM scored: 0 for question 1
The LLM scored: 0 for question 2
The LLM scored: 0 for question 3
The LLM scored: 0 for question 4
The LLM scored: 0 for question 5
The LLM scored: 0 for question 6
The LLM scored: 0 for question 7
The LLM scored: 1 for question 8
The LLM scored: 0 for question 9
The LLM scored: 0 for question 10
Now evaluating the response based on: correctness
The LLM scored: 1 for question 1
The LLM scored: 1 for question 2
The LLM scored: 1 for question 3
The LLM scored: 1 for question 4
The LLM scored: 1 for question 5
The LLM scored: 1 for 

In [113]:
for i in sample_qa['question']:
    criteria_dict_rag_off['question'].append(i)

In [114]:
df_criteria_scores_rag_off = pd.DataFrame(criteria_dict_rag_off)
df_criteria_scores_rag_off

Unnamed: 0,question,conciseness,relevance,correctness,coherence,harmfulness,maliciousness,helpfulness,controversiality,misogyny,criminality,insensitivity,depth,creativity,detail
0,When should the Court of Protection be involve...,1.0,0,1,1,0,0,0,0,0,0,0,0.0,0,1.0
1,What should I do if I'm worried about someone ...,0.0,0,1,1,0,0,1,0,0,0,0,1.0,1,1.0
2,"What is radiotherapy, and how is it used to tr...",0.0,0,1,1,0,0,1,0,0,0,0,1.0,1,1.0
3,What should you avoid after a cornea transplan...,1.0,0,1,1,0,0,1,0,0,0,0,1.0,0,0.0
4,What is pudendal neuralgia?,0.0,0,1,1,0,0,1,0,0,0,0,,0,
5,Who can give consent for a child's medical tre...,1.0,0,1,1,0,0,1,0,0,0,0,1.0,0,0.0
6,What kinds of mental health information does t...,1.0,0,1,1,0,0,1,0,0,0,0,1.0,0,1.0
7,Are there any specific symptoms that may resul...,0.0,1,1,1,0,0,1,0,0,0,0,1.0,0,1.0
8,When should I call 111 or get an urgent GP app...,,0,0,1,0,0,1,0,0,0,0,0.0,1,0.0
9,What can be done at home to manage sickle cell...,0.0,0,1,1,0,0,1,0,0,0,0,1.0,0,1.0
