# WP3 Advanced RAG

## HyDE Example

**HyDE** (Hypothetical Document Embeddings) is an advanced RAG technique that specifically improves the retrieval of relevant documents. 

In basic RAG, semantic search compares a query against the documents in the database.  
But sometimes the documents being searched through are in a very different form to the queries asked.

HyDE attempts to improve this by doing a document to document semantic comparison.  
An LLM uses the query to produce a hypothetical document in the same form that they are found in the database.  
This hypothetical document is then used as the starting point for the semantic search.

### Setup

In [None]:
import glob
import os
import pandas as pd
import random


import toml
from dotenv import load_dotenv

import src.models as models

from tqdm import tqdm

config = toml.load("config.toml")
load_dotenv(".secrets")
os.environ["ANTHROPIC_API_KEY"] = os.getenv("anthropic_key")

if config['DEV_MODE']:
    config['PERSIST_DIRECTORY'] += "/dev"


As in rag_demo.ipynb, we first initialise the RAG pipeline.

In [None]:
rag_pipeline = models.RagPipeline(config['EMBEDDING_MODEL'], config['PERSIST_DIRECTORY'])

We also need to fill in the database if it is empty.

In [None]:
# Add documents if there are non - if in DEV mode, don't add any more (if it's not empty)
if len(rag_pipeline.vectorstore.get()['documents']) == 0 or (not config['DEV_MODE']):
    rag_pipeline.load_documents()  

We will also load in a random question from cogstack.

In [None]:
#load processed questions and answers
cogstack_qa = pd.read_csv('src/model_eval/cogstack_qa_data_process.csv')

#select a random sample question
sample_qa = cogstack_qa.sample(n = 1, random_state = 999)

question = sample_qa['question'].values[0]
print(question)

### Using HyDE

We will now use HyDE to answer a RAG question. For an example using only basic RAG, please look at rag_demo.ipynb.

We request a response from the LLM with both `rag` and `hyde` turned on.

When generating a hypothetical document, the following prompt is used:

In [None]:
"""Generate a hypothetical NHS conditions page based on the following question.\
Focus on providing a comprehensive overview, including key details about the condition's symptoms, underlying causes,\
and recommended treatment modalities. Keep in mind the target audience of general readers seeking reliable health information.\
The conditions page should be under 1000 characters.
    
QUESTION: """

The response to this prompt is then used to retrieve relevant documents.

In [None]:
result_hyde_on = rag_pipeline.answer_question(question, rag=True, hyde=True)

print(result_hyde_on)

## Evaluating HyDE

We can compare RAG results with and wihout HyDE. The following code iterates through 50 question answer pairs, 5 times, and checks whether the correct refereces have been used in the response. 

The folling code does take a while to run, so a summary of previous experiments is shown below:

| Random Sample|     HyDE Scores |Mean HyDE Score| No HyDE Scores      | Mean No HyDE Score|
|--------------|-----------------|---------------|---------------------|-------------------|
| 999          | 28 27 27 26 26  |     26.8      | 27 27 26 27 27      |       26.8        |
| 1000         | 38 30 27 31 26  |     30.4      | 29 28 26 28 28      |       27.8        | 
| 1001         | 30 32 31 31 32  |     31.2      | 28 32 28 28 27      |       28.6        |

Testing on 3 random samples of 50 questions, we found that HyDE was marginally better at retrieving the correct references.

Score refers to the number of answers containing the correct reference.



In [None]:
source_scores = {}

# Non-deterministic nature of LLM's means we want to test HyDE multiple times on the same test set.
for sample in range(5):

    # Sample 50 questions
    sample_size = 50
    sample_questions = cogstack_qa.sample(n = sample_size, random_state = 1001)

    score_hyde_on  = 0
    score_hyde_off = 0

    for index,row in enumerate(sample_questions.itertuples()):
        
        question = row.question
        
        result_hyde_off = rag_pipeline.answer_question(question, rag=True, hyde=False)
        result_hyde_on  = rag_pipeline.answer_question(question, rag=True, hyde=True)

        for result, hyde_on in zip([result_hyde_off,result_hyde_on],[False,True]):
        
            # Check if the LLM quoted one of the references used in the cogstack response
            try: 
                idx  = result.split().index('SOURCES:')
                # Occasionaslly outputs do not include any sources. Need to andle these errors.
            
                sources = []

                for i in result.split()[idx + 1:]:
                    j = i.replace('(', '')
                    j = j.replace(')', '')
                    j = j.replace('.txt', '')
                    j = j.replace(',', '')
                    sources.append(j)

                for i in sources:
                    if i in row.reference:
                        if hyde_on:
                            score_hyde_on += 1
                        else:
                            score_hyde_off += 1
                
            except ValueError as e:
                print(e)
                print(result)

    source_scores[f"sample_{sample}_hyde_on"] = score_hyde_on
    source_scores[f"sample_{sample}_hyde_off"] = score_hyde_off

    print(f"\nSample {sample}")
    print(f"Correct references with HyDE: {score_hyde_on} / {sample_size}")
    print(f"Correct references without HyDE: {score_hyde_off} / {sample_size}")
