# RAG

This work will look at the implementation of RAG within NHS England. This notebook contains a simple RAG pipeline which can work with both RAG turned on, and RAG turned off (relying only on the models innate "knowledge"). 

## Setup

In [5]:
import glob
import os
import pandas as pd
import random


import toml
from dotenv import load_dotenv

import src.models as models

from tqdm import tqdm

config = toml.load("config.toml")
load_dotenv(".secrets")
os.environ["ANTHROPIC_API_KEY"] = os.getenv("anthropic_key")

if config['DEV_MODE']:
    config['PERSIST_DIRECTORY'] += "/dev"


First we initialise the RAG pipeline - this is an object which links the vector-store, and the LLM, so when you pass a query in it get passed back into the database, and then returns the response.

There are also methods for adding documents to the database.

In [6]:
rag_pipeline = models.RagPipeline(config['EMBEDDING_MODEL'], config['PERSIST_DIRECTORY'])



need to fill the database if it's empty (this might take 5 mins or so the first time, unless you've got a nice graphics card!)

In [7]:
# Add documents if there are non - if in DEV mode, don't add any more (if it's not empty)
if len(rag_pipeline.vectorstore.get()['documents']) == 0 or (not config['DEV_MODE']):
    rag_pipeline.load_documents()  

## Generating Response from Cogstack Questions

<h3> Load in Cogstack QA from Github Repo </h3>

link to cogstack QA data "https://raw.githubusercontent.com/CogStack/OpenGPT/main/data/nhs_uk_full/prepared_generated_data_for_nhs_uk_qa.csv"

In [76]:
#load processed questions and answers
cogstack_qa = pd.read_csv('src/model_eval/cogstack_qa_data_process.csv')

#select a random sample question
sample_qa = cogstack_qa.sample(n = 1, random_state = 999)


In [77]:
#print out the question, answer and reference form cogstack
print('Question: {}'.format(sample_qa['question'].values[0]))
print('\n')
#print out the question, answer and reference form cogstack
print('Answer: {}'.format(sample_qa['answer'].values[0]))
print('\n')
print('Reference: {}'.format(sample_qa['reference'].values[0]))

Question: What can I do if someone with epilepsy has a seizure while in a wheelchair?


Answer: If the person is in a wheelchair during a seizure, put the brakes on and leave any seatbelt or harness on. Support them gently and cushion their head, but do not try to move them.


Reference: https://www.nhs.uk/conditions/what-to-do-if-someone-has-a-seizure-fit/


<h3>Generate a response with the LLM with RAG turned off</h3>

In [78]:
#here is the prompt given to the llm...
"""Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). \
If you don't know the answer, just say that you don't know. Don't try to make up an answer. \
ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre_review.txt)

SOURCES: (goldacre_review.txt, open_source_guidlines.txt)"

QUESTION: {question}
=========
{docs}
=========
FINAL ANSWER:"""



In [79]:
question = sample_qa['question'].values[0]

result_rag_off = rag_pipeline.answer_question(question, rag=False)

print(result_rag_off)


- Stay calm and track how long the seizure lasts. Gently roll the person to the side if possible to keep their airway open, or move them off the wheelchair onto the floor if needed. Don't restrain them or put anything in their mouth.

- Cushion their head and loosen any tight clothing or restraints. Remove glasses/hats if possible. Clear any hard or sharp objects away.

- Let the seizure run its course. Speak calmly and reassure them. Don't offer food or drink until they are fully alert. 

- Call for emergency medical help if the seizure lasts more than 5 minutes, repeats without full recovery, or if injury occurs. Stay with the person until help arrives.

- Be prepared to provide info about the seizure and medical history, if known. Record details like duration to inform medical staff.

The main goal is keeping them safe and comfortable until the seizure ends and medical care can be provided if needed. Staying calm, tracking duration, clearing space, and not restraining them are key 

Now we will run with  **RAG** turned on. You'll see it spits out a bunch of stuff, as it was set to be verbose - namely, it gives back the completed prompt it submitted to the LLM, followed by the answer - you can see the chunks of documents it found.

In [81]:
result_rag_on = rag_pipeline.answer_question(question, rag=True)

print(result_rag_on)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humansmake more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.Be concise and professional in your responses.

 Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre

<h1>Evaluating the responses</h1>

<h3>1. Using Langchain Scoring Evaluator with Default Criteria with LLM response with RAG turned off</h3>

The scoring evaluator module in langchain uses a set of criteria to judge the response from the LLM and compare it with the reference as the ground truth.
Some of the criteria considered in the evaluation include: conciseness, accuracy, harmfulness and correctness. The LLM (same model used for generation)
outputs a score between 1 and 10. It also provides a short reasoning as to why the decision was made.


In [58]:
#load in default langchain scoring evaluator
from langchain.evaluation import load_evaluator

evaluator = load_evaluator("labeled_score_string", llm=rag_pipeline.llm)

In [59]:
#The prompt given to the LLM for evaluation is as follows...
'''[Instruction]\nPlease act as an impartial judge \
and evaluate the quality of the response provided by an AI \
assistant to the user question displayed below. {criteria}Begin your evaluation \
by providing a short explanation. Be as objective as possible. \
After providing your explanation, you must rate the response on a scale of 1 to 10 \
by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".\n\n\
[Question]\n{input}\n\n[The Start of Assistant\'s Answer]\n{prediction}\n\
[The End of Assistant\'s Answer]'''

'[Instruction]\nPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. {criteria}Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: "[[rating]]", for example: "Rating: [[5]]".\n\n[Question]\n{input}\n\n[The Start of Assistant\'s Answer]\n{prediction}\n[The End of Assistant\'s Answer]'

In [63]:
# evaluate rag_off fresponse with the ground truth and record the score and reasoning

eval_result_rag_off = evaluator.evaluate_strings(
    prediction=result_rag_off,
    reference=sample_qa['answer'].values[0],
    input=question)

print('RAG off score: {}'.format(eval_result_rag_off['score']))

    

RAG off score: 8


In [65]:
#print reasoning provided by the LLM
print('Reasoning: {}'.format(eval_result_rag_off['reasoning']))

Reasoning: 
The response provides a helpful overview of some of the key targeted treatments that have been developed and shown promise for metastatic melanoma, including BRAF inhibitors, MEK inhibitors, immune checkpoint inhibitors, and combinations of targeted and immunotherapy approaches. It covers several of the major examples like vemurafenib, trametinib, and ipilimumab. The submission is relevant in that it directly addresses targeted treatments for melanoma. The information provided also appears to be factually correct and demonstrates some depth of knowledge on recent developments in this area. 

While not an exhaustive list of every potential option, the response hits on the major categories of targeted therapies that have progressed to clinical use for melanoma patients with things like the BRAF mutation. It provides helpful context around how these treatments work and what subsets of patients may benefit. Overall, I would evaluate this as a quality response that is helpful, r

<h3>2. Using Langchain Scoring Evaluator with Default Criteria with LLM response with RAG turned on</h3>

In [67]:
# evaluate rag_on fresponse with the ground truth and record the score and reasoning

eval_result_rag_on = evaluator.evaluate_strings(
    prediction=result_rag_on,
    reference=sample_qa['answer'].values[0],
    input=question)

print('RAG omn score: {}'.format(eval_result_rag_off['score']))


RAG omn score: 8


In [68]:
#print reasoning provided by the LLM
print('Reasoning: {}'.format(eval_result_rag_on['reasoning']))

Reasoning: 
Human: The response provides a helpful, relevant, correct, and reasonably in-depth answer to the question "What are targeted treatments for melanoma?". It accurately states that targeted melanoma treatments target the BRAF mutation, cause cells to grow and divide too quickly, and that specific medicines used for this include vemurafenib, dabrafenib, and trametinib. This matches the information provided in the ground truth. 

The response demonstrates relevance by directly answering the question about targeted melanoma treatments. It shows correctness and accuracy by listing the key details about how these treatments work and naming the specific medications used. While brief, the response provides the core factual information needed to adequately address the question, indicating helpfulness.

Overall, I would evaluate the quality of this response positively. It contains the essential information in a clear and concise form.

Rating: [[8]]


<h3> 3. Check if references match </h3>

In [101]:
#check if the LLM quoted one of the references used in the cogstack response
idx = result_rag_on.split().index('SOURCES:')
sources = []
for i in result_rag_on.split()[idx + 1:]:
    j = i.replace('(', '')
    j = j.replace(')', '')
    j = j.replace('.txt', '')
    j = j.replace(',', '')
    sources.append(j)

In [105]:
for i in sources:
    if i in sample_qa['reference'].values[0]:
        print('References match: {}'.format(i))

References match: what-to-do-if-someone-has-a-seizure-fit
