# Creating Synthetic Test Data for RAG Pipelines with LangChain, OpenAI, and Ragas

- Evaluating the performance of Retrieval-Augmented Generation (RAG) pipelines typically requires large sets of Question-Context-Answer (QA) samples, which can be time-consuming and labor-intensive to create. 

- Traditional human-generated questions often lack complexity, limiting the depth of evaluation.

To address this challenge, we utilize LangChain, OpenAI, and Ragas to generate high-quality synthetic question/ground truth pairs based on provided documents. 
This approach reduces data aggregation time by up to 90% while ensuring a more comprehensive evaluation process.

Advantages of Synthetic Data Generation:  
   - Reduces time and effort
   - Efficiently creates diverse question-answer pairs
   - Enables testing of retrieval systems across multiple contexts

In [11]:
!pip install -qU pandas langchain-core langchain-community langchain-openai ragas  python-dotenv

In [20]:
# Check versions of installed packages
!pip show langchain-community
!pip show langchain-core
!pip show langchain-openai 
!pip show ragas

Name: langchain-community
Version: 0.3.0
Summary: Community contributed LangChain integrations.
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: C:\Users\padma\anaconda3\envs\rag_agent\Lib\site-packages
Requires: aiohttp, dataclasses-json, langchain, langchain-core, langsmith, numpy, pydantic-settings, PyYAML, requests, SQLAlchemy, tenacity
Required-by: ragas
Name: langchain-core
Version: 0.3.1
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: C:\Users\padma\anaconda3\envs\rag_agent\Lib\site-packages
Requires: jsonpatch, langsmith, packaging, pydantic, PyYAML, tenacity, typing-extensions
Required-by: langchain, langchain-community, langchain-openai, langchain-text-splitters, ragas
Name: langchain-openai
Version: 0.2.0
Summary: An integration package connecting OpenAI and LangChain
Home-page: https://github.com/langchain-ai/l

Set up:
- set llm
    - OpenAI Api Key
    - generator model - creates question
    - critic model -  to analyze and accept questions or suggesting changes
    - embedding -  To convert text to vector

          
- set Raga   generator, distribution
    - create the generator
    - define the distribution of questions (50% simple, 40% multi-context, 10% reasoning)

In [14]:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

# Load environment variables
load_dotenv()

## Define LLM
def set_llm():
    """
    Initializes Large Language Models (LLMs) using the OpenAI API key.

    Returns:
         Generator LLM, Critic LLM, and Embeddings instance.
    """
    api_key = os.getenv("OPENAI_API_KEY")
    
    if api_key is None:
        raise ValueError("OPENAI_API_KEY environment variable is missing")
        
    # Initialize LLMs with the API key
    generator_llm = ChatOpenAI(api_key=api_key, model="gpt-3.5-turbo")
    critic_llm = ChatOpenAI(api_key=api_key, model="gpt-4o-mini")
    embeddings = OpenAIEmbeddings(api_key=api_key)
    
    return generator_llm,critic_llm,embeddings

## Define the generator and distribution for RAGA
def set_raga_generator_distribution(generator_llm,critic_llm,embeddings):
    """
    Defines the generator and distribution for RAGA.

    Args:
    - generator_llm: Generator LLM instance.
    - critic_llm : Critic LLM instance.
    - embeddings : Embeddings instance.

    Returns:
        TestsetGenerator instance and distribution dictionary.
    """
    generator = TestsetGenerator.from_langchain(
        generator_llm,
        critic_llm,
        embeddings
    )

    distributions = {
        simple: 0.5,
        multi_context: 0.4,
        reasoning: 0.1
    }
    
    return generator, distributions
    
generator_llm,critic_llm,embeddings = set_llm()
generator, distributions = set_raga_generator_distribution(generator_llm,critic_llm,embeddings)

In [13]:
# Test if openai key is valid and is working fine
word = "hello"
vector = embeddings.embed_query("hello")
vector[0:3]

[-0.025150660425424576, -0.019543994218111038, -0.027953993529081345]

Load our sample dataset with langchain and generate Synthatic data for our sample dataset

The resulting dataframe contains
- question
- context
- ground truth
- evolution type

In [2]:
from langchain_community.document_loaders import TextLoader
from ragas.run_config import RunConfig

def generate_synthetic_data(file_path: str, distributions: Distribution) -> pd.DataFrame:
    """
    Generates synthetic data based on the provided sample text file.

    Args:
    - file_path (str): The path to the text file used for generating synthetic data.
    - distributions (Distribution): The distribution configuration for generating synthetic data.

    Returns:
    - pd.DataFrame: A pandas DataFrame containing the generated synthetic data.
    """

    # Load text data from file
    loader = TextLoader(file_path)
    documents = loader.load()

    # Generate synthetic data using langchain
    testset = generator.generate_with_langchain_docs(
        documents, 
        num_outputs=10, 
        distributions=distributions, 
        run_config=RunConfig(max_retries=50, max_wait=100)
    )

    # Convert generated data to pandas DataFrame
    test_df = testset.to_pandas()

    return test_df
    
file_path = "./refrigerator.txt"
test_df = raga_generate_synthatic_data(file_path)
test_df.to_csv('ragas_syn_data.csv',index=False)

embedding nodes:   0%|          | 0/2 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/10 [00:00<?, ?it/s]

max retries exceeded for MultiContextEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=180, max_retries=50, max_wait=100, max_workers=16, exception_types=<class 'openai.RateLimitError'>, log_tenacity=False, seed=42)), docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x000002B680860F80>, nodes=[Node(metadata={'source': './refrigerator.txt'}, page_content='Key Features of a Refrigerator\nRefrigerators are essential appliances in households and industries alike, offering a range of features that improve convenience, efficiency, and preservation of food. Below are the primary reasons why refrigerators are vital:\n\nFood Preservation\n\nProlonging Freshness: Refrigerators keep perishable items such as fruits, vegetables, meat, and dairy products fresh for longer by maintaining a low temperature, slowing bacterial growth.\nPreventing Spoilage: By providing consistent cooling, they reduce the likelihood of food spoilage,

max retries exceeded for MultiContextEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=180, max_retries=50, max_wait=100, max_workers=16, exception_types=<class 'openai.RateLimitError'>, log_tenacity=False, seed=42)), docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x000002B680860F80>, nodes=[Node(metadata={'source': './refrigerator.txt'}, page_content='Key Features of a Refrigerator\nRefrigerators are essential appliances in households and industries alike, offering a range of features that improve convenience, efficiency, and preservation of food. Below are the primary reasons why refrigerators are vital:\n\nFood Preservation\n\nProlonging Freshness: Refrigerators keep perishable items such as fruits, vegetables, meat, and dairy products fresh for longer by maintaining a low temperature, slowing bacterial growth.\nPreventing Spoilage: By providing consistent cooling, they reduce the likelihood of food spoilage,

max retries exceeded for MultiContextEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=180, max_retries=50, max_wait=100, max_workers=16, exception_types=<class 'openai.RateLimitError'>, log_tenacity=False, seed=42)), docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x000002B680860F80>, nodes=[Node(metadata={'source': './refrigerator.txt'}, page_content='Key Features of a Refrigerator\nRefrigerators are essential appliances in households and industries alike, offering a range of features that improve convenience, efficiency, and preservation of food. Below are the primary reasons why refrigerators are vital:\n\nFood Preservation\n\nProlonging Freshness: Refrigerators keep perishable items such as fruits, vegetables, meat, and dairy products fresh for longer by maintaining a low temperature, slowing bacterial growth.\nPreventing Spoilage: By providing consistent cooling, they reduce the likelihood of food spoilage,

max retries exceeded for MultiContextEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=180, max_retries=50, max_wait=100, max_workers=16, exception_types=<class 'openai.RateLimitError'>, log_tenacity=False, seed=42)), docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x000002B680860F80>, nodes=[Node(metadata={'source': './refrigerator.txt'}, page_content='Key Features of a Refrigerator\nRefrigerators are essential appliances in households and industries alike, offering a range of features that improve convenience, efficiency, and preservation of food. Below are the primary reasons why refrigerators are vital:\n\nFood Preservation\n\nProlonging Freshness: Refrigerators keep perishable items such as fruits, vegetables, meat, and dairy products fresh for longer by maintaining a low temperature, slowing bacterial growth.\nPreventing Spoilage: By providing consistent cooling, they reduce the likelihood of food spoilage,

In [3]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,How do modern refrigerators contribute to ener...,[Key Features of a Refrigerator\nRefrigerators...,Modern refrigerators contribute to energy effi...,simple,[{'source': './refrigerator.txt'}],True
1,How do refrigerators provide ample storage cap...,[Key Features of a Refrigerator\nRefrigerators...,Refrigerators provide ample storage capacity f...,simple,[{'source': './refrigerator.txt'}],True
2,How do modern refrigerators utilize advanced c...,[Key Features of a Refrigerator\nRefrigerators...,Modern refrigerators utilize advanced cooling ...,simple,[{'source': './refrigerator.txt'}],True
3,How do modern refrigerators utilize advanced c...,[Key Features of a Refrigerator\nRefrigerators...,Modern refrigerators utilize advanced cooling ...,simple,[{'source': './refrigerator.txt'}],True
4,What are some convenience features found in mo...,[Key Features of a Refrigerator\nRefrigerators...,Some convenience features found in modern refr...,simple,[{'source': './refrigerator.txt'}],True
5,How do modern refrigerators improve convenienc...,[Key Features of a Refrigerator\nRefrigerators...,Modern refrigerators improve convenience by of...,reasoning,[{'source': './refrigerator.txt'}],True
