# RAG

This work will look at the implementation of RAG within NHS England. This notebook contains a simple RAG pipeline which can work with both RAG turned on, and RAG turned off (relying only on the models innate "knowledge"). 

## Setup

In [9]:
import glob
import os
import pandas as pd
import random

import toml
from dotenv import load_dotenv

import src.models as models

from tqdm import tqdm

config = toml.load("config.toml")
load_dotenv(".secrets")
os.environ["ANTHROPIC_API_KEY"] = os.getenv("anthropic_key")

if config['DEV_MODE']:
    config['PERSIST_DIRECTORY'] += "/dev"


First we initialise the RAG pipeline - this is an object which links the vector-store, and the LLM, so when you pass a query in it get passed back into the database, and then returns the response.

There are also methods for adding documents to the database.

In [2]:
rag_pipeline = models.RagPipeline(config['EMBEDDING_MODEL'], config['PERSIST_DIRECTORY'])

  return self.fget.__get__(instance, owner)()


need to fill the database if it's empty (this might take 5 mins or so the first time, unless you've got a nice graphics card!)

In [3]:
# Add documents if there are non - if in DEV mode, don't add any more (if it's not empty)
if len(rag_pipeline.vectorstore.get()['documents']) == 0 or (not config['DEV_MODE']):
    rag_pipeline.load_documents()  

## RAG vs Non-RAG

In [16]:
#load cogstack question and answers
cogstack_qa = pd.read_csv('src/model_eval/cogstack_qa_data_process.csv')
random.seed(1234)
sample_qa = cogstack_qa.sample(n = 10)

sample_qa


Unnamed: 0.1,Unnamed: 0,question,answer,reference,short_reference
20561,20561,What does a care and support plan include for ...,"If you are a carer, your support plan will inc...",https://www.nhs.uk/conditions/social-care-and-...,social-care-and-support-guide
5522,5522,What causes club foot?,"In most cases, the cause of club foot is unkno...",https://www.nhs.uk/conditions/club-foot/,club-foot
8360,8360,What are the symptoms of diarrhoea and vomiting?,The symptoms of diarrhoea and vomiting are fre...,https://www.nhs.uk/conditions/diarrhoea-and-vo...,diarrhoea-and-vomiting
1421,1421,What is the difference between anticoagulants ...,"Although used for similar purposes, anticoagul...",https://www.nhs.uk/conditions/anticoagulants/,anticoagulants
17148,17148,What are the possible side effects of dopamine...,Possible side effects of dopamine agonists inc...,https://www.nhs.uk/conditions/parkinsons-disea...,parkinsons-disease
12089,12089,Are home testing and home sampling kits for HI...,Home testing and home sampling kits are availa...,https://www.nhs.uk/conditions/hiv-and-aids/dia...,hiv-and-aids
17432,17432,What is a diabetic foot ulcer?,A diabetic foot ulcer is an open wound or sore...,https://www.nhs.uk/conditions/peripheral-neuro...,peripheral-neuropathy
15283,15283,Can pregnant women have an MRI scan?,Although there is no evidence that MRI scans a...,https://www.nhs.uk/conditions/mri-scan/who-can...,mri-scan
9542,9542,What should I do if I'm feeling sick and it ke...,"If you're frequently feeling sick, see a GP wh...",https://www.nhs.uk/conditions/feeling-sick-nau...,feeling-sick-nausea
9911,9911,How do I get my test result if I did an NHS PC...,You'll usually get a text or email with your r...,https://www.nhs.uk/conditions/coronavirus-covi...,coronavirus-covid-19


In [83]:
llm_responses = []
llm_references = []

for index, row in sample_qa.iterrows():
    #retrieve question answer and references from df
    cogstack_q = row['question']
    cogstack_a = row['answer']
    cogstack_ref = row['short_reference']

    #run question prompt through LLM and append result
    result = rag_pipeline.answer_question(cogstack_q, rag=True)

    #separate by word and extract reference and generated response
    llm_result = result.split()[:-2]
    if not llm_result:
        llm_result = ''
        llm_ref = ''
    else:
        llm_result = ' '.join(llm_result)
        llm_ref = ' '.join(result.split()[-2:])

    #append generated response and corresponding reference
    llm_responses.append(llm_result)
    llm_references.append(llm_ref)

    







[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humansmake more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.Be concise and professional in your responses.

 Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre

In [84]:
sample_qa['generated_response'] = llm_responses
sample_qa['generated_reference'] = llm_ref

In [85]:
sample_qa

Unnamed: 0.1,Unnamed: 0,question,answer,reference,short_reference,generated_response,generated_reference
20561,20561,What does a care and support plan include for ...,"If you are a carer, your support plan will inc...",https://www.nhs.uk/conditions/social-care-and-...,social-care-and-support-guide,"Unfortunately, the provided documents do not s...","cervical-screening.txt, genetic-and-genomic-te..."
5522,5522,What causes club foot?,"In most cases, the cause of club foot is unkno...",https://www.nhs.uk/conditions/club-foot/,club-foot,"The exact cause of club foot is often unknown,...","cervical-screening.txt, genetic-and-genomic-te..."
8360,8360,What are the symptoms of diarrhoea and vomiting?,The symptoms of diarrhoea and vomiting are fre...,https://www.nhs.uk/conditions/diarrhoea-and-vo...,diarrhoea-and-vomiting,The symptoms of diarrhoea and vomiting include...,"cervical-screening.txt, genetic-and-genomic-te..."
1421,1421,What is the difference between anticoagulants ...,"Although used for similar purposes, anticoagul...",https://www.nhs.uk/conditions/anticoagulants/,anticoagulants,The main difference between anticoagulants and...,"cervical-screening.txt, genetic-and-genomic-te..."
17148,17148,What are the possible side effects of dopamine...,Possible side effects of dopamine agonists inc...,https://www.nhs.uk/conditions/parkinsons-disea...,parkinsons-disease,The sources provided did not include informati...,"cervical-screening.txt, genetic-and-genomic-te..."
12089,12089,Are home testing and home sampling kits for HI...,Home testing and home sampling kits are availa...,https://www.nhs.uk/conditions/hiv-and-aids/dia...,hiv-and-aids,"Based on the information provided, home testin...","cervical-screening.txt, genetic-and-genomic-te..."
17432,17432,What is a diabetic foot ulcer?,A diabetic foot ulcer is an open wound or sore...,https://www.nhs.uk/conditions/peripheral-neuro...,peripheral-neuropathy,A diabetic foot ulcer is a sore that develops ...,"cervical-screening.txt, genetic-and-genomic-te..."
15283,15283,Can pregnant women have an MRI scan?,Although there is no evidence that MRI scans a...,https://www.nhs.uk/conditions/mri-scan/who-can...,mri-scan,"According to the MRI scan document, pregnant w...","cervical-screening.txt, genetic-and-genomic-te..."
9542,9542,What should I do if I'm feeling sick and it ke...,"If you're frequently feeling sick, see a GP wh...",https://www.nhs.uk/conditions/feeling-sick-nau...,feeling-sick-nausea,It looks like the documents mention some commo...,"cervical-screening.txt, genetic-and-genomic-te..."
9911,9911,How do I get my test result if I did an NHS PC...,You'll usually get a text or email with your r...,https://www.nhs.uk/conditions/coronavirus-covi...,coronavirus-covid-19,"SOURCES: (blood-tests.txt, nhs-screening.txt, ...","cervical-screening.txt, genetic-and-genomic-te..."


In [None]:
#add generated response and refernce columns to dataframe


Now we will run with  **RAG** turned on. You'll see it spits out a bunch of stuff, as it was set to be verbose - namely, it gives back the completed prompt it submitted to the LLM, followed by the answer - you can see the chunks of documents it found.

In [81]:
result = rag_pipeline.answer_question(sample_qa.iloc[1]['question'], rag=True)

print(result)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a helpful assistant that helps people with their questions. You are not a replacement for human judgement, but you can help humansmake more informed decisions. If you are asked a question you cannot answer based on your following instructions, you should say so.Be concise and professional in your responses.

 Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). If you don't know the answer, just say that you don't know. Don't try to make up an answer. ALWAYS return a "SOURCES" part in your answer.

Example 1: "**RAP** is to be the foundation of analyst training. SOURCES: (goldacre_review.txt)"
Example 2: "Open source code is a good idea because:
* it's cheap (goldacre_review.txt)
* it's easy for people to access and use (open_source_guidlines.txt)
* it's easy to share (goldacre

In [70]:
result.split()[result.split().index('SOURCES:') + 1:len(result.split())]

['(alzheimers-disease.txt,',
 'stroke.txt,',
 'rett-syndrome.txt,',
 'dementia-with-lewy-bodies.txt,',
 'frontotemporal-dementia.txt,',
 'multiple-system-atrophy.txt,',
 'frontotemporal-dementia.txt,',
 'dementia-with-lewy-bodies.txt,',
 'stroke.txt,',
 'vascular-dementia.txt)']

In [50]:
list1 = ['a', 'b', 'c']
list1.index('a')

0

In [54]:
full_response = result.split()

In [60]:
' '.join(full_response[:-2])

'Based on the information provided, the known cause of club foot is that the Achilles tendon (the large tendon at the back of the ankle) is too short. Club foot may also have a genetic link, as it can run in families.'

In [56]:
full_response[len(full_res

TypeError: list indices must be integers or slices, not str

In [21]:
cogstack_answer = sample_qa.iloc[1]['answer']
cogstack_reference = sample_qa.iloc[1]['short_reference']
print(cogstack_answer)
print(cogstack_reference)

In most cases, the cause of club foot is unknown. However, there may be a genetic link, as it can run in families. In rare cases, it may be linked to more serious conditions such as spina bifida.
club-foot
