# RAG

This work will look at the implementation of RAG within NHS England. This notebook contains a simple RAG pipeline which can work with both RAG turned on, and RAG turned off (relying only on the models innate "knowledge"). 

## Setup

In [22]:
import glob
import os
import pandas as pd
import random


import toml
from dotenv import load_dotenv

import src.models as models

from tqdm import tqdm

config = toml.load("config.toml")
load_dotenv(".secrets")
os.environ["ANTHROPIC_API_KEY"] = os.getenv("anthropic_key")

if config['DEV_MODE']:
    config['PERSIST_DIRECTORY'] += "/dev"


First we initialise the RAG pipeline - this is an object which links the vector-store, and the LLM, so when you pass a query in it get passed back into the database, and then returns the response.

There are also methods for adding documents to the database.

In [23]:
rag_pipeline = models.RagPipeline(config['EMBEDDING_MODEL'], config['PERSIST_DIRECTORY'])



need to fill the database if it's empty (this might take 5 mins or so the first time, unless you've got a nice graphics card!)

In [24]:
# Add documents if there are non - if in DEV mode, don't add any more (if it's not empty)
if len(rag_pipeline.vectorstore.get()['documents']) == 0 or (not config['DEV_MODE']):
    rag_pipeline.load_documents()  

## Generating Response from Cogstack Questions

<h3> Load in Cogstack QA from Github Repo </h3>

link to cogstack QA data "https://raw.githubusercontent.com/CogStack/OpenGPT/main/data/nhs_uk_full/prepared_generated_data_for_nhs_uk_qa.csv"

In [25]:
#load processed questions and answers
cogstack_qa = pd.read_csv('src/model_eval/cogstack_qa_data_process.csv')

#select a random sample question
sample_qa = cogstack_qa.sample(n = 1, random_state = 999)


In [26]:
#print out the question, answer and reference form cogstack
print('Question: {}'.format(sample_qa['question'].values[0]))
print('\n')
#print out the question, answer and reference form cogstack
print('Answer: {}'.format(sample_qa['answer'].values[0]))
print('\n')
print('Reference: {}'.format(sample_qa['reference'].values[0]))

Question: What can I do if someone with epilepsy has a seizure while in a wheelchair?


Answer: If the person is in a wheelchair during a seizure, put the brakes on and leave any seatbelt or harness on. Support them gently and cushion their head, but do not try to move them.


Reference: https://www.nhs.uk/conditions/what-to-do-if-someone-has-a-seizure-fit/


<h3>Generate a response with the LLM with RAG turned off</h3>

In [27]:
#here is the prompt given to the llm...
"""Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). \
If you don't know the answer, just say that you don't know. Don't try to make up an answer. \
ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre_review.txt)

SOURCES: (goldacre_review.txt, open_source_guidlines.txt)"

QUESTION: {question}
=========
{docs}
=========
FINAL ANSWER:"""



In [28]:
question = sample_qa['question'].values[0]

result_rag_off = rag_pipeline.answer_question(question, rag=False)

print(result_rag_off)


- Stay calm and track how long the seizure lasts. Protect the person from injury, but don't restrain or put anything in their mouth.

- Make sure the wheelchair is locked so it doesn't move. Gently support the person's head and body if they start to fall to one side. Remove glasses and loosen any tight clothing around their neck.

- Clear any objects around them to prevent injury. Pad sharp edges on nearby furniture. Move chairs/objects so there's open space around the wheelchair.

- Ease them onto the floor if the wheelchair allows it. This helps avoid falling from the chair. Protect their head as you lower them.

- Cushion their head with a folded jacket or blanket once on the floor. Don't put anything in their mouth or restrain them, just protect from injury. 

- After the seizure, turn them on their side to keep their airway clear in case they vomit. Reassure them when they regain consciousness and remain with them until fully recovered.

The main priorities are to protect them du

Now we will run with  **RAG** turned on. You'll see it spits out a bunch of stuff, as it was set to be verbose - namely, it gives back the completed prompt it submitted to the LLM, followed by the answer - you can see the chunks of documents it found.

In [29]:
result_rag_on = rag_pipeline.answer_question(question, rag=True)

print(result_rag_on)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humansmake more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.Be concise and professional in your responses.

 Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre

<h1>Evaluating the responses</h1>

<h3>1. Using Langchain Scoring Evaluator with Default Criteria with LLM response with RAG turned off</h3>

The scoring evaluator module in langchain uses a set of criteria to judge the response from the LLM and compare it with the reference as the ground truth.
Some of the criteria considered in the evaluation include: conciseness, accuracy, harmfulness and correctness. The LLM (same model used for generation)
outputs a score between 1 and 10. It also provides a short reasoning as to why the decision was made.


In [30]:
#load in default langchain scoring evaluator
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("labeled_score_string", llm=rag_pipeline.llm)

In [31]:
#The prompt given to the LLM for evaluation is as follows...
'''[Instruction]\nPlease act as an impartial judge \
and evaluate the quality of the response provided by an AI \
assistant to the user question displayed below. {criteria}Begin your evaluation \
by providing a short explanation. Be as objective as possible. \
After providing your explanation, you must rate the response on a scale of 1 to 10 \
by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".\n\n\
[Question]\n{input}\n\n[The Start of Assistant\'s Answer]\n{prediction}\n\
[The End of Assistant\'s Answer]'''

'[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. {criteria}Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".\n\n[Question]\n{input}\n\n[The Start of Assistant\'s Answer]\n{prediction}\n[The End of Assistant\'s Answer]'

In [32]:
# evaluate rag_off fresponse with the ground truth and record the score and reasoning

eval_result_rag_off = evaluator.evaluate_strings(
    prediction=result_rag_off,
    reference=sample_qa['answer'].values[0],
    input=question)

print('RAG off score: {}'.format(eval_result_rag_off['score']))

    

RAG off score: 7


In [33]:
#print reasoning provided by the LLM
print('Reasoning: {}'.format(eval_result_rag_off['reasoning']))

Reasoning: 
Explanation: The assistant's response covers several helpful points regarding how to assist someone having a seizure in a wheelchair, including staying calm, making sure the wheelchair is locked, gently supporting the person, clearing space around them, easing them onto the floor if possible, cushioning their head, not restraining them or putting anything in their mouth, turning them on their side after the seizure, and providing reassurance when they regain consciousness. The response demonstrates good depth and relevance by addressing the key aspects of the question.

However, the response does not specifically mention leaving any seatbelt or harness on or supporting the person's head gently, as stated in the ground truth. It also does not provide a strict verbatim quote from the ground truth text. As such, while helpful and thoughtful, it lacks some correctness and relevance compared to the ground truth.

Rating: [[7]]

The response covers the main points well and provid

<h3>2. Using Langchain Scoring Evaluator with Default Criteria with LLM response with RAG turned on</h3>

In [34]:
# evaluate rag_on fresponse with the ground truth and record the score and reasoning

eval_result_rag_on = evaluator.evaluate_strings(
    prediction=result_rag_on,
    reference=sample_qa['answer'].values[0],
    input=question)

print('RAG omn score: {}'.format(eval_result_rag_off['score']))


RAG omn score: 7


In [35]:
#print reasoning provided by the LLM
print('Reasoning: {}'.format(eval_result_rag_on['reasoning']))

Reasoning: 
The assistant's response accurately summarizes the key steps to take if someone has a seizure while in a wheelchair, as outlined in the ground truth. It helps by clearly explaining what actions to take, including putting on the brakes, leaving on any restraints, gently supporting them, and cushioning their head. The response is fully relevant, referring directly to the question that was asked. It is also factually correct based on the source text provided. Overall, it demonstrates helpfulness, relevance, correctness, and depth.

Rating: [[10]]


<h3> 3. Check if references match </h3>

In [36]:
#check if the LLM quoted one of the references used in the cogstack response
idx = result_rag_on.split().index('SOURCES:')
sources = []
for i in result_rag_on.split()[idx + 1:]:
    j = i.replace('(', '')
    j = j.replace(')', '')
    j = j.replace('.txt', '')
    j = j.replace(',', '')
    sources.append(j)

In [37]:
for i in sources:
    if i in sample_qa['reference'].values[0]:
        print('References match: {}'.format(i))

References match: what-to-do-if-someone-has-a-seizure-fit


In [38]:
sample_qa['reference'].values[0]

'https://www.nhs.uk/conditions/what-to-do-if-someone-has-a-seizure-fit/'

<h3> 4. Using HyDE </h3>

In [39]:
result_hyde_on = rag_pipeline.answer_question(question, rag=True, hyde=True)

print(result_hyde_on)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humansmake more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.Be concise and professional in your responses.

 Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre

In [40]:
source_scores = {}

# Non-deterministic nature of LLM's means we want to test HyDE multiple times on the same test set.
for sample in range(5):

    # Sample 50 questions
    sample_size = 50
    sample_questions = cogstack_qa.sample(n = sample_size, random_state = 1000)

    score_hyde_on  = 0
    score_hyde_off = 0

    for index,row in enumerate(sample_questions.itertuples()):
        
        question = row.question
        
        result_hyde_off = rag_pipeline.answer_question(question, rag=True, hyde=False)
        result_hyde_on  = rag_pipeline.answer_question(question, rag=True, hyde=True)

        for result, hyde_on in zip([result_hyde_off,result_hyde_on],[False,True]):
        
            # Check if the LLM quoted one of the references used in the cogstack response
            try: 
                idx  = result.split().index('SOURCES:')
                # Occasionaslly outputs do not include any sources. Need to andle these errors.
            
                sources = []

                for i in result.split()[idx + 1:]:
                    j = i.replace('(', '')
                    j = j.replace(')', '')
                    j = j.replace('.txt', '')
                    j = j.replace(',', '')
                    sources.append(j)

                for i in sources:
                    if i in row.reference:
                        if hyde_on:
                            score_hyde_on += 1
                        else:
                            score_hyde_off += 1
                
            except ValueError as e:
                print(e)
                print(result)

    source_scores[f"sample_{sample}_hyde_on"] = score_hyde_on
    source_scores[f"sample_{sample}_hyde_off"] = score_hyde_off

    print(f"\nSample {sample}")
    print(f"Correct references with HyDE: {score_hyde_on} / {sample_size}")
    print(f"Correct references without HyDE: {score_hyde_off} / {sample_size}")







[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humansmake more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.Be concise and professional in your responses.

 Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre

In [41]:
source_scores

{'sample_0_hyde_on': 38,
 'sample_0_hyde_off': 29,
 'sample_1_hyde_on': 30,
 'sample_1_hyde_off': 28,
 'sample_2_hyde_on': 27,
 'sample_2_hyde_off': 26,
 'sample_3_hyde_on': 31,
 'sample_3_hyde_off': 28,
 'sample_4_hyde_on': 26,
 'sample_4_hyde_off': 28}

In [42]:
print(f"Correct references with HyDE: {score_hyde_on} / {sample_size}")
print(f"Correct references without HyDE: {score_hyde_off} / {sample_size}")

# hyde better sample 2 and 3 is really good. 3: 25/50 with, 21 without
# sample 999, 29 without, 28 with hyde
# New prompt, 27 with, 29 without. 28 with, 27 without

Correct references with HyDE: 26 / 50
Correct references without HyDE: 28 / 50


In [None]:
#999
# {'sample_0_hyde_on': 28,
#  'sample_0_hyde_off': 27,
#  'sample_1_hyde_on': 27,
#  'sample_1_hyde_off': 27,
#  'sample_2_hyde_on': 27,
#  'sample_2_hyde_off': 26,
#  'sample_3_hyde_on': 26,
#  'sample_3_hyde_off': 27,
#  'sample_4_hyde_on': 26,
#  'sample_4_hyde_off': 27}

# 1000
# {'sample_0_hyde_on': 38,
#  'sample_0_hyde_off': 29,
#  'sample_1_hyde_on': 30,
#  'sample_1_hyde_off': 28,
#  'sample_2_hyde_on': 27,
#  'sample_2_hyde_off': 26,
#  'sample_3_hyde_on': 31,
#  'sample_3_hyde_off': 28,
#  'sample_4_hyde_on': 26,
#  'sample_4_hyde_off': 28}