<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/kirit-branch/docs/notebooks/data-designer/rag-examples/generate-rag-evaluation-dataset.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Data Designer: Generate Diverse RAG Evaluations

This tutorial demonstrates how to generate comprehensive evaluation datasets for Retrieval-Augmented Generation (RAG) systems, customized to your content and use cases. 

You'll learn how to create diverse question-answer pairs at scale, covering a variety of difficulty levels and reasoning types, including both answerable and unanswerable scenarios.

### What You'll Learn
- How to process and chunk source documents for RAG evaluation

- How to configure categorical distributions for controlled diversity

- How to generate high-quality Q&A pairs with structured output

- How to evaluate the quality of generated pairs with rubric-based scoring

- How to analyze and export the complete dataset

## 1. Setup and Installation

First, we'll install the required packages for document processing, text generation, and data handling.

In [2]:
# Install required libraries
!pip install -qq langchain smart_open git+https://github.com/gretelai/gretel-python-client@main
!pip install 'unstructured[pdf]'

## 2. Configuration

Let's define our source documents and the total number of evaluation pairs we want to generate. You can replace the document list with your own PDFs, web pages, or other text sources.

In [1]:
# Define source documents and total number of evaluation pairs to generate
# You can replace this with your own documents
DOCUMENT_LIST = ["https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/rag_evals/databricks-state-of-data-ai-report.pdf"]

## 3. Document Processing

Now we'll create a Document Processor class that handles loading and chunking the source documents. 

This class uses langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing.

In [2]:
from typing import List, Union
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition
from smart_open import open
import tempfile
import os

class DocumentProcessor:
    """Handles loading and chunking source documents for RAG evaluation."""
    
    def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):
        """Initialize with configurable chunk size and overlap."""
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )

    def parse_document(self, uri: str) -> str:
        """Parse a single document from URI into raw text."""
        with open(uri, 'rb') as file:
            content = file.read()
            with tempfile.NamedTemporaryFile(delete=False) as temp_file:
                temp_file.write(content)
                temp_file.flush()
                elements = partition(temp_file.name)

        os.unlink(temp_file.name)
        return "\n\n".join([str(element) for element in elements])

    def process_documents(self, uris: Union[str, List[str]]) -> List[str]:
        """Process one or more documents into chunks for RAG evaluation."""
        if isinstance(uris, str):
            uris = [uris]

        all_chunks = []
        for uri in uris:
            text = self.parse_document(uri)
            chunks = self.text_splitter.split_text(text)
            all_chunks.extend(chunks)

        return all_chunks

## 4. Data Models

Let's define Pydantic models for structured output generation. These schemas will ensure our generated data has consistent structure and validation.

In [3]:
from pydantic import BaseModel, Field

class QAPair(BaseModel):
    question: str = Field(
        ..., description="A specific question related to the domain of the context"
    )
    answer: str = Field(
        ..., description="Either a context-supported answer or explanation of why the question cannot be answered"
    )
    reasoning: str = Field(
        ..., description="A clear and traceable explanation of the reasoning behind the answer"
    )

## 5. Processing Documents and Setting Up Data Designer

Now we'll process our document chunks and set up the Data Designer with our seed dataset.

In [None]:
import pandas as pd
from gretel_client.navigator_client import Gretel

# Process document chunks
processor = DocumentProcessor(chunk_size=4192, chunk_overlap=200)
chunks = processor.process_documents(DOCUMENT_LIST)

# Create a seed DataFrame with the document chunks
seed_df = pd.DataFrame({"context": chunks})

# Initialize Gretel client and Data Designer
# You can use "prompt" for API key to be prompted interactively
gretel = Gretel(api_key="prompt")
aidd = gretel.data_designer.new(model_suite="llama-3.x")

# Upload the seed dataset with document chunks
# Using shuffle with replacement allows the model to reuse context chunks
aidd.with_seed_dataset(seed_df, sampling_strategy="shuffle", with_replacement=True)

## 6. Adding Categorical Columns for Controlled Diversity

Now we'll add categorical columns to control the diversity of our RAG evaluation pairs. We'll define:

1. **Difficulty levels**: easy, medium, hard

2. **Reasoning types**: factual recall, inferential reasoning, etc.

3. **Question types**: answerable vs. unanswerable (with weighting)

In [None]:
from gretel_client.data_designer import columns as C
from gretel_client.data_designer import params as P

aidd.add_column(
    C.SamplerColumn(
        name="difficulty",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["easy", "medium", "hard"],
            description="The difficulty level of the question"
        )
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="reasoning_type",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "factual recall",
                "inferential reasoning",
                "comparative analysis",
                "procedural understanding",
                "cause and effect"
            ],
            description="The type of reasoning required to answer the question"
        )
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="question_type",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["answerable", "unanswerable"],
            # 10:1 ratio of answerable to unanswerable questions.
            weights=[10, 1],  
        )
    )
).validate()

## 7. Adding LLM-Structured Column for Q&A Pair Generation

Now let's set up the core of our data generation: the Q&A pair column that will produce structured question-answer pairs based on our document context and control parameters.

In [None]:
from gretel_client.data_designer import columns as C

# Add Q&A pair generation column
aidd.add_column(
    C.LLMStructuredColumn(
        name="qa_pair",
        system_prompt=(
            "You are an expert at generating high-quality RAG evaluation pairs. "
            "You are very careful in assessing whether the question can be answered from the provided context. "
        ),
        prompt="""\
{{context}}

Generate a {{difficulty}} {{reasoning_type}} question-answer pair.
The question should be {{question_type}} using the provided context.

For answerable questions:
- Ensure the answer is fully supported by the context

For unanswerable questions:
- Keep the question topically relevant
- Make it clearly beyond the context's scope
""",
        output_format=QAPair
    )
).validate()

## 8. Adding Evaluation Metrics with Custom Rubrics

To assess the quality of our generated Q&A pairs, we'll add evaluation metrics using detailed rubrics for scoring. 

We use Data Designer's `LLMJudgeColumn` for this, defining a set of custom Rubrics designed for our task.

In [None]:
from gretel_client.data_designer.params import Rubric
from gretel_client.data_designer import columns as C

context_relevance_rubric = Rubric(
    name="Context Relevance",
    description="Evaluates how relevant the answer is to the provided context",
    scoring={
        "5": "Perfect relevance to context with no extraneous information",
        "4": "Highly relevant with minor deviations from context",
        "3": "Moderately relevant but includes some unrelated information",
        "2": "Minimally relevant with significant departure from context",
        "1": "Almost entirely irrelevant to the provided context"
    }
)

answer_precision_rubric = Rubric(
    name="Answer Precision",
    description="Evaluates the accuracy and specificity of the answer",
    scoring={
        "5": "Extremely precise with exact, specific information",
        "4": "Very precise with minor imprecisions",
        "3": "Adequately precise but could be more specific",
        "2": "Imprecise with vague or ambiguous information",
        "1": "Completely imprecise or inaccurate"
    }
)

answer_completeness_rubric = Rubric(
    name="Answer Completeness",
    description="Evaluates how thoroughly the answer addresses all aspects of the question",
    scoring={
        "5": "Fully complete, addressing all aspects of the question",
        "4": "Mostly complete with minor omissions",
        "3": "Adequately complete but missing some details",
        "2": "Substantially incomplete, missing important aspects",
        "1": "Severely incomplete, barely addresses the question"
    }
)

hallucination_avoidance_rubric = Rubric(
    name="Hallucination Avoidance",
    description="Evaluates the absence of made-up or incorrect information",
    scoring={
        "5": "No hallucinations, all information is factual and verifiable",
        "4": "Minimal hallucinations that don't impact the core answer",
        "3": "Some hallucinations that partially affect the answer quality",
        "2": "Significant hallucinations that undermine the answer",
        "1": "Severe hallucinations making the answer entirely unreliable"
    }
)

EVAL_METRICS_PROMPT_TEMPLATE = """\
You are an expert evaluator of question-answer pairs. Analyze the following Q&A pair and evaluate it objectively.

For this {{difficulty}} {{reasoning_type}} Q&A pair:
{{qa_pair}}

Take a deep breath and carefully evaluate each criterion based on the provided rubrics, considering the difficulty level and reasoning type indicated.
"""

aidd.add_column(
    C.LLMJudgeColumn(
        name="eval_metrics",
        prompt=EVAL_METRICS_PROMPT_TEMPLATE,
        rubrics=[context_relevance_rubric, answer_precision_rubric, answer_completeness_rubric, hallucination_avoidance_rubric],
    )
).validate()

## 9. Preview Sample Records

Let's generate a preview to see what our data will look like before running the full generation.

In [None]:
preview = aidd.preview()

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset.df.head()

In [None]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

## 11. Generate the Full Dataset

Now let's generate our full dataset of RAG evaluation pairs, analyze the coverage, and export it to a JSONL file for use in evaluating RAG systems.

In [None]:
# Let's add an evaluation report to the dataset
aidd.with_evaluation_report()

# Generate the full dataset
workflow_run = aidd.create(
   num_records=100,
   name="rag_eval_generation"
)

# This will block until the workflow is done.
workflow_run.wait_until_done()

In [None]:
print("\nGenerated dataset shape:", workflow_run.dataset.df.shape)

# Export the dataset to JSONL format.
workflow_run.dataset.df.to_json('rag_evals.jsonl', orient='records', lines=True)
print("\nDataset exported to rag_evals.jsonl")

## 12. Using Your RAG Evaluation Dataset

Now that you've generated a diverse RAG evaluation dataset, here are some ways to use it:

1. **Benchmarking**: Test your RAG system against these evaluation pairs to measure performance

2. **Error Analysis**: Identify patterns in where your RAG system struggles

3. **Optimization**: Use insights to tune retrieval and generation parameters

4. **Regression Testing**: Track performance over time as you improve your system

5. **Model Comparison**: Compare different LLMs, retrievers, or RAG architectures

The JSONL file contains structured data with questions, ground truth answers, and quality metrics that you can use with most evaluation frameworks.