Evaluating RAG (Retrieval-Augmented Generation) augmented pipelines is crucial for assessing their performance. However, manually creating hundreds of QA (Question-Context-Answer) samples from documents can be time-consuming and labor-intensive. Additionally, human-generated questions may struggle to reach the level of complexity required for a thorough evaluation, ultimately impacting the quality of the assessment. By using synthetic data generation developer time in data aggregation process can be reduced by 90%.

In [None]:
!pip install ragas==0.1.13 langchain-openai==0.1.21 sentence_transformers xmltodict python-dotenv


Collecting ragas==0.1.13
  Downloading ragas-0.1.13-py3-none-any.whl.metadata (5.2 kB)
Collecting langchain-openai==0.1.21
  Downloading langchain_openai-0.1.21-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-core (from ragas==0.1.13)
  Downloading langchain_core-0.2.43-py3-none-any.whl.metadata (6.2 kB)
Collecting langsmith<0.2.0,>=0.1.112 (from langchain-core->ragas==0.1.13)
  Downloading langsmith-0.1.144-py3-none-any.whl.metadata (14 kB)
INFO: pip is looking at multiple versions of langchain to determine which version is compatible with other requirements. This could take a while.
Collecting langchain (from ragas==0.1.13)
  Downloading langchain-0.3.7-py3-none-any.whl.metadata (7.1 kB)
  Downloading langchain-0.3.6-py3-none-any.whl.metadata (7.1 kB)
  Downloading langchain-0.3.5-py3-none-any.whl.metadata (7.1 kB)
  Downloading langchain-0.3.4-py3-none-any.whl.metadata (7.1 kB)
  Downloading langchain-0.3.3-py3-none-any.whl.metadata (7.1 kB)
  Downloading langchain-0.3.2-py3

In [None]:
import os
from google.colab import userdata
import pandas as pd
from langchain_community.document_loaders import PubMedLoader
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_openai import ChatOpenAI
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context


In [None]:


# from ragas.testset.evolutions import simple, reasoning, multi_context

In [None]:
os.environ["OPENAI_API_KEY"] = ''

In [None]:
data_generation_model = ChatOpenAI(model='gpt-4o-mini')

In [None]:
critic_model = ChatOpenAI(model='gpt-4o')

In [None]:
model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
loader = PubMedLoader("cancer", load_max_docs=5)

In [None]:
loader

<langchain_community.document_loaders.pubmed.PubMedLoader at 0x78b6ec4eaf80>

In [None]:
documents = loader.load()

In [None]:
generator = TestsetGenerator.from_langchain(
    data_generation_model,
    critic_model,
    embeddings
)


In [None]:
documents[0]


Document(metadata={'uid': '39575654', 'Title': 'Cytokine release syndrome caused by immune checkpoint inhibitors: a case report and literature review.', 'Published': '2024-11-22', 'Copyright Information': ''}, page_content="Immune checkpoint inhibitors (ICIs) have gained widespread application in the treatment of malignant tumors. Cytokine release syndrome (CRS) is a systemic inflammatory response triggered by various factors, including infections and immunotherapy. We present a case of CRS occurring in a gastric cancer patient after receiving combination therapy of tislelizumab, anlotinib and combination of capecitabine and oxaliplatin. Nineteen days after the third dose of tislelizumab, the patient experienced sudden unconsciousness, frothing at the mouth, convulsions and other clinical manifestations resembling epileptiform seizures. Elevated inflammatory markers, cytokine levels and ferritin were markedly increased. Given the absence of definite clinical evidence for metastasis and

In [None]:
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

In [None]:
testset = generator.generate_with_langchain_docs(documents, 5, distributions)

embedding nodes:   0%|          | 0/10 [00:00<?, ?it/s]



Generating:   0%|          | 0/5 [00:00<?, ?it/s]

In [None]:
test_df = testset.to_pandas()

In [None]:
print(test_df)

                                            question  \
0  What was the outcome of the combination therap...   
1  What is the significance of conditional indepe...   
2  How does conditional independence affect breas...   
3  How does the BOADICEA BC model assess breast c...   
4  How do post-RAI thyroglobulin levels relate to...   

                                            contexts  \
0  [Immune checkpoint inhibitors (ICIs) have gain...   
1  [We consider estimation of measures of model p...   
2  [We consider estimation of measures of model p...   
3  [The German Consortium for Hereditary Breast a...   
4  [OBJECTIVE: This study aimed to assess the use...   

                                        ground_truth evolution_type  \
0  The outcome of the combination therapy involvi...         simple   
1  The significance of conditional independence i...         simple   
2  The context does not provide specific informat...  multi_context   
3  The BOADICEA BC risk model assesses bre

In [None]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What was the outcome of the combination therap...,[Immune checkpoint inhibitors (ICIs) have gain...,The outcome of the combination therapy involvi...,simple,"[{'uid': '39575654', 'Title': 'Cytokine releas...",True
1,What is the significance of conditional indepe...,[We consider estimation of measures of model p...,The significance of conditional independence i...,simple,"[{'uid': '39575627', 'Title': 'Sensitivity ana...",True
2,How does conditional independence affect breas...,[We consider estimation of measures of model p...,The context does not provide specific informat...,multi_context,"[{'uid': '39575627', 'Title': 'Sensitivity ana...",True
3,How does the BOADICEA BC model assess breast c...,[The German Consortium for Hereditary Breast a...,The BOADICEA BC risk model assesses breast can...,multi_context,"[{'uid': '39575650', 'Title': 'Calculating fut...",True
4,How do post-RAI thyroglobulin levels relate to...,[OBJECTIVE: This study aimed to assess the use...,The context does not provide information on th...,multi_context,"[{'uid': '39575624', 'Title': 'Predictive valu...",True


In [None]:
test_df.to_csv("data.csv")

Ragas takes a novel approach to evaluation data generation. An ideal evaluation dataset should encompass various types of questions encountered in production, including questions of varying difficulty levels. LLMs by default are not good at creating diverse samples as it tends to follow common paths. Inspired by works like Evol-Instruct, Ragas achieves this by employing an evolutionary generation paradigm, where questions with different characteristics such as reasoning, conditioning, multi-context, and more are systematically crafted from the provided set of documents.