# **Advanced RAG Technique and Evaluation**

In this document we will try the standard RAG versus the compressor based RAG.
We use GPT3.5 Turbo as LLM and will use as retriever a contextual compressor, which only takes the most relevant information from the retrieved documents by the similarity search.

Requirements: Please make sure to execute first LangChainRAG/Embedding-OpenAI-Chroma.ipynb to embed our medical documents. This Notebook is merely applying GPT3.5 Turbo as LLM.

In [49]:
!pip -q install langchain openai chromadb sentence_transformers evaluate rouge_score bert_score bleu_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
     ---------------------------------------- 61.1/61.1 kB 3.2 MB/s eta 0:00:00
Installing collected packages: bert_score
Successfully installed bert_score-0.3.13


ERROR: Could not find a version that satisfies the requirement bleu_score (from versions: none)
ERROR: No matching distribution found for bleu_score


In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

## **OpenAI Authenticatation**
We use OpenAIs GPT3.5 Turbo. Make sure to have balance on your OpenAI Dashboard and create a personal secret key at https://platform.openai.com/api-keys.

In [2]:
import os
os.environ['OPENAI_API_KEY'] = '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'
openai_api_key = os.environ.get("OPENAI_API_KEY")

## **Load Chroma and GPT3.5 Turbo LLM**
We first load the Chroma vector database.

In [3]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings,GPT4AllEmbeddings,HuggingFaceBgeEmbeddings

In [4]:
import os

persist_directory = "E:\\NLPT\\_Q-A-INLPT-WS2023\\chroma_openai-003\\chroma_openai"
# Create the directory if it does not exist
if not os.path.exists(persist_directory):
    print(f"Please execute first LangChainRAG/Embedding-OpenAI-Chroma.ipynb, we didn't find any Chroma vector storage.")
else:
    print(f"Directory '{persist_directory}' exists, perfect!")

Directory 'E:\NLPT\_Q-A-INLPT-WS2023\chroma_openai-003\chroma_openai' exists, perfect!


In [5]:
db3 = Chroma(persist_directory=persist_directory, embedding_function=OpenAIEmbeddings())

  warn_deprecated(


In [6]:
from langchain import hub
from langchain_openai import ChatOpenAI

from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


retriever = db3.as_retriever() # print(dir(db3)) to get all functions, attributes
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


In [7]:
rag_chain.invoke("Why is the compatibility of drugs with excipients important in pharmaceutical formulations? And how can machine learning aid exactly?")

'The compatibility of drugs with excipients is crucial in pharmaceutical formulations to ensure stability, efficacy, and safety of the medication. Machine learning can aid in drug formulation development by predicting formulations, optimizing properties, and accelerating the discovery process through data analysis and pattern recognition. Deep learning, specifically convolutional neural networks, can excel in image analysis for biomarker identification and optimizing drug formulation.'

## **Load Pubmed QA Dataset**
We will now use the pubmed_qa from Hugging Face, which is merely a dataset that consists of many QA pairs. Each question has a detailed long answer. The idea is, that the answers retrieved by our LLMs should be "similar" to the answers from the dataset. For our purposes we only extract a random subset of the pairs.

In [8]:
from datasets import load_dataset,DatasetDict
dataset = load_dataset("pubmed_qa", "pqa_artificial")
num_test_samples = 100  # Choose the number of samples for the test set
# Assuming `dataset` is a DatasetDict and 'train' is the key for the training set
training_set = dataset['train']

# Create a test set by taking a subset of samples from the training set
test_set = training_set.shuffle(seed=42).select([i for i in range(num_test_samples)])
# Remove the selected samples from the training set, to avoid overlap
selected_pubids = [sample['pubid'] for sample in test_set]
training_set = training_set.filter(lambda x: x['pubid'] not in selected_pubids)

new_dataset_dict = DatasetDict({
    'train': training_set,
    'test': test_set,
})

dataset = new_dataset_dict
print(dataset)

questions=dataset["test"]["question"]
answers=dataset["test"]["long_answer"]

Filter:   0%|          | 0/211269 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
        num_rows: 211169
    })
    test: Dataset({
        features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
        num_rows: 100
    })
})


## **Generate answers using our original RAG.**
We now use rawly our GPT3.5 Turbo model to retrieve the answers for the questions.

In [9]:
from tqdm import tqdm

simple_answers = []
# Assuming 'questions' is your corpus of questions
for question in tqdm(questions, desc="Processing questions"):
    answer = rag_chain.invoke(question)
    simple_answers.append(answer)

Processing questions:   0%|          | 0/100 [00:00<?, ?it/s]

Processing questions: 100%|██████████| 100/100 [02:38<00:00,  1.58s/it]


## **Generate answers using contextual compression**
Here we use LLMChainExtractor to only take the relevant information from each document. We prepare the compressor based retriever and generate answers in an analogous way as above.

In [10]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor,LLMChainFilter
from langchain.llms import OpenAI

compressor = LLMChainExtractor.from_llm(
    llm=llm
)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

In [11]:
print("Fist question in the set:",questions[10])
compressed_docs = compression_retriever.get_relevant_documents(questions[2])
compressed_docs

Fist question in the set: Does integrative analysis of methylome and transcriptome reveal the importance of unmethylated CpGs in non-CpG island gene activation?




[Document(page_content='telbivudine plus pegylated interferon alfa-2a in a randomized study in chronic hepatitis B', metadata={'keywords': '', 'seq_num': 76446, 'source/title': 'efficacy and safety of telbivudine treatment for the prevention of hbv perinatal transmission.'})]

In [12]:
rag_chain_compressor = (
    {"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


In [13]:
compressor_answers = []
# Assuming 'questions' is your corpus of questions
for question in tqdm(questions, desc="Processing questions"):
    answer = rag_chain_compressor.invoke(question)
    compressor_answers.append(answer)

Processing questions: 100%|██████████| 100/100 [06:50<00:00,  4.11s/it]


## **Evaluation**
We now use different metric scores to compare the performance of standard versus compressor based RAG.

In [16]:
import evaluate


bleu = evaluate.load('bleu')
rouge = evaluate.load('rouge')
bertscore = evaluate.load("bertscore")

# Compute BLEU score
bleu_simple = evaluate.load("bleu").compute(predictions=simple_answers, references=answers)

# Compute ROUGE score
rouge_simple = evaluate.load("rouge").compute(predictions=simple_answers, references=answers)



# Compute BLEU score
bleu_compressor = evaluate.load("bleu").compute(predictions=compressor_answers, references=answers)

# Compute ROUGE score
rouge_compressor = evaluate.load("rouge").compute(predictions=compressor_answers, references=answers)


print("BLEU Score Simple:", bleu_simple)
print("BLEU Score Compressor:", bleu_compressor)
print("_________________________")
print("ROUGE Score Simple:", rouge_simple)
print("ROUGE Score Compressor:", rouge_compressor)

BLEU Score Simple: {'bleu': 0.014219017220274232, 'precisions': [0.23206018518518517, 0.05036855036855037, 0.0274869109947644, 0.014705882352941176], 'brevity_penalty': 0.3049827687110593, 'length_ratio': 0.45714285714285713, 'translation_length': 1728, 'reference_length': 3780}
BLEU Score Compressor: {'bleu': 0.06899455448795624, 'precisions': [0.2684931506849315, 0.07918263090676884, 0.04351245085190039, 0.02449528936742934], 'brevity_penalty': 1.0, 'length_ratio': 1.062169312169312, 'translation_length': 4015, 'reference_length': 3780}
_________________________
ROUGE Score Simple: {'rouge1': 0.07569058544827892, 'rouge2': 0.020179545370722664, 'rougeL': 0.05299057070918502, 'rougeLsum': 0.05366707976861297}
ROUGE Score Compressor: {'rouge1': 0.2449988876301581, 'rouge2': 0.09001675398594082, 'rougeL': 0.1839259986196345, 'rougeLsum': 0.18342894960793732}


In [17]:
import numpy as np
### your code ###
bertscore_simple = bertscore.compute(predictions=simple_answers, references=answers, lang="en")
bertscore_compressor = bertscore.compute(predictions=compressor_answers, references=answers, lang="en")
bertscore_simple_averaged={}
bertscore_compressor_averaged={}
for key in bertscore_simple.keys():
  if key!='hashcode':
    bertscore_simple_averaged[key]=np.mean(bertscore_simple[key])
    bertscore_compressor_averaged[key]=np.mean(bertscore_compressor[key])

print("BERT Score Simple:",bertscore_simple_averaged)
print("BERT Score Compressor:",bertscore_compressor_averaged)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERT Score Simple: {'precision': 0.8311528718471527, 'recall': 0.8213963240385056, 'f1': 0.8260968506336213}
BERT Score Compressor: {'precision': 0.8677619963884353, 'recall': 0.8683865022659302, 'f1': 0.8679185700416565}


## **Conclusion:**

We can clearly see, that contextual compression improves the accuracy of the RAG.