<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/qa-generation/generate-qa-pairs.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Data Designer: Generate Diverse Q&A Pairs

## Overview

This notebook demonstrates a versatile approach for generating high-quality question-answer (Q&A) pairs from any source document (text, markdown, or PDF files). The generated Q&A pairs are valuable for:

- **Instruction Tuning:** Training models with clear, self-contained examples
- **Retrieval-Augmented Generation (RAG):** Enhancing retrieval systems with precise Q&A pairs
- **Search and FAQ Systems:** Powering natural language query systems and documentation
- **Educational Content:** Creating quizzes and learning materials

## What You'll Learn

- How to process and chunk source documents for Q&A generation
- How to configure categorical distributions for controlled diversity
- How to generate high-quality Q&A pairs with structured output
- How to evaluate the quality of generated pairs with rubric-based scoring
- How to analyze and export the complete dataset

## 1. Setup and Installation

First, let's install and import the required packages for document processing, text generation, and data handling.

In [None]:
# Install required libraries
!pip install -qq langchain smart_open git+https://github.com/gretelai/gretel-python-client@main 
!pip install 'unstructured[pdf]'

## 2. Configuration

Define your source document(s) and the number of Q&A pairs to generate. You can replace this with your own documents in PDF, markdown, or text formats.

In [1]:
# Configuration
# -------------
# Define your source document(s) and the number of Q&A pairs to generate.
# You can replace this with your own documents in PDF, markdown, or text formats.

DOCUMENT_LIST = ["https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/rag_evals/databricks-state-of-data-ai-report.pdf"]

## 3. Document Processing

The DocumentProcessor class handles loading and chunking source documents for Q&A generation. We use langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing.

In [2]:
from typing import List, Union
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition
from smart_open import open
import tempfile
import os

class DocumentProcessor:
    """Handles loading and chunking source documents for Q&A generation."""
    
    def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):
        """Initialize with configurable chunk size and overlap."""
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )

    def parse_document(self, uri: str) -> str:
        """Parse a single document from URI into raw text."""
        with open(uri, 'rb') as file:
            content = file.read()
            with tempfile.NamedTemporaryFile(delete=False) as temp_file:
                temp_file.write(content)
                temp_file.flush()
                elements = partition(temp_file.name)

        os.unlink(temp_file.name)
        return "\n\n".join([str(element) for element in elements])

    def process_documents(self, uris: Union[str, List[str]]) -> List[str]:
        """Process one or more documents into chunks for Q&A generation."""
        if isinstance(uris, str):
            uris = [uris]

        all_chunks = []
        for uri in uris:
            text = self.parse_document(uri)
            chunks = self.text_splitter.split_text(text)
            all_chunks.extend(chunks)

        return all_chunks

## 4. Data Models

Let's define Pydantic models for structured output generation. These schemas will ensure our generated data has consistent structure and validation.

In [3]:
from pydantic import BaseModel, Field
from typing import Optional, Literal

class QAPair(BaseModel):
    """Schema for question-answer pairs"""
    context: str = Field(..., description="The context used to make the question and answer.")
    question: str = Field(..., description="A clear and concise question derived from the context.")
    answer: str = Field(..., description="A detailed and accurate answer fully supported by the context.")


class EvalMetrics(BaseModel):
    """Schema for scoring generation quality"""
    clarity: int = Field(
        ...,
        description="How clear and understandable is the question? (1=vague/confusing, 5=perfectly clear and well-structured)",
        ge=1,
        le=5
    )
    factual_accuracy: int = Field(
        ...,
        description="Is the answer fully supported by the context? (1=contains errors/unsupported claims, 5=completely accurate and supported)",
        ge=1,
        le=5
    )
    comprehensiveness: int = Field(
        ...,
        description="Does the answer cover all necessary details? (1=missing crucial information, 5=complete coverage of all relevant points)",
        ge=1,
        le=5
    )
    answer_relevance: int = Field(
        ...,
        description="Does the answer directly address what was asked in the question? (1=misaligned/tangential, 5=perfectly addresses the question)",
        ge=1,
        le=5
    )

## 5. Processing Documents and Setting Up Data Designer

Now we'll process our document chunks and set up the Data Designer with our seed dataset.

In [None]:
import pandas as pd
from gretel_client.navigator_client import Gretel

# Process document chunks
processor = DocumentProcessor(chunk_size=4192, chunk_overlap=200)
chunks = processor.process_documents(DOCUMENT_LIST)

# Create a seed DataFrame with the document chunks
seed_df = pd.DataFrame({"context": chunks})

# Initialize Gretel client and Data Designer
# You can use "prompt" for API key to be prompted interactively
gretel = Gretel(api_key="prompt", endpoint="https://api.dev.gretel.ai")
aidd = gretel.data_designer.new(model_suite="llama-3.x")  # or "apache-2.0" as needed

# Upload the seed dataset with document chunks
# Using shuffle with replacement allows the model to reuse context chunks
aidd.with_seed_dataset(seed_df, sampling_strategy="shuffle", with_replacement=True)

## 6. Adding Categorical Columns for Controlled Diversity

Let's add categorical columns to control the diversity of our Q&A pairs. We'll define different levels of difficulty, question styles, and target audiences to create a rich dataset.

In [None]:
# Add categorical columns for controlled diversity
aidd.add_column(
    name="difficulty",
    type="category",
    params={"values": ["easy", "medium", "hard"]}
)

# Add sophistication as a subcategory of difficulty
aidd.add_column(
    name="sophistication",
    type="subcategory",
    params= {
        "category": "difficulty",
        "values": {
            "easy": [
                "basic", 
                "straightforward"
            ],
            "medium": [
                "intermediate", 
                "moderately complex"
            ],
            "hard": [
                "advanced", 
                "sophisticated"
            ]
        }
    }
)

# Add other categorical columns
aidd.add_column(
    name="question_style",
    type="category",
    params={"values": ["factual", "exploratory", "analytical", "comparative"]}
)

aidd.add_column(
    name="target_audience",
    type="category",
    params={"values": ["novice", "intermediate", "expert"]}
)

aidd.add_column(
    name="response_format",
    type="category",
    params={"values": ["short", "detailed", "step-by-step", "list"]}
)

## 7. Adding LLM-Structured Column for Q&A Pair Generation

Now let's set up the core of our data generation: the Q&A pair column that will produce structured question-answer pairs based on our document context and control parameters.

In [None]:
# Add Q&A pair generation column
aidd.add_column(
    name="qa_pair",
    type="llm-structured",
    system_prompt="""You are an expert at generating high-quality, context-supported Q&A pairs.
Your output should be clear, concise, and factually correct.
Ensure that every question is self-contained and every answer is comprehensive and derived solely from the provided context.
""",
    prompt="""\n{{context}}\n\n
Based on the above context, generate a high-quality Q&A pair. The question should be clear and concise, 
and tailored for an audience at the '{{target_audience}}' level. The overall difficulty should be '{{difficulty}}', 
with a corresponding sophistication level (e.g., {{sophistication}}). The question style should be '{{question_style}}', 
and the answer should be provided in a '{{response_format}}' format. 
Ensure that the answer is factually accurate, fully derived from the context, and comprehensive.\n
Put your thoughts within <think>...</think> before providing the JSON.""",
    output_format=QAPair
)

## 8. Adding Evaluation Metrics with Rubrics

To assess the quality of our generated Q&A pairs, we'll add evaluation metrics using detailed rubrics for scoring.

In [None]:
from gretel_client.data_designer.params import Rubric

# Define evaluation rubrics
clarity_rubric = Rubric(
    name="Clarity",
    description="Evaluates how clear and understandable the response is",
    scoring={
        "5": "Exceptionally clear and easy to understand",
        "4": "Very clear with minor ambiguities",
        "3": "Adequately clear but could be improved",
        "2": "Somewhat unclear or confusing",
        "1": "Very unclear or difficult to understand"
    }
)

factual_accuracy_rubric = Rubric(
    name="Factual Accuracy",
    description="Evaluates the correctness of factual information in the response",
    scoring={
        "5": "Completely accurate with no errors",
        "4": "Mostly accurate with minor errors",
        "3": "Somewhat accurate with some errors",
        "2": "Mostly inaccurate with significant errors",
        "1": "Completely inaccurate or misleading"
    }
)

comprehensiveness_rubric = Rubric(
    name="Comprehensiveness",
    description="Evaluates how thoroughly the response addresses the question",
    scoring={
        "5": "Completely comprehensive, covering all aspects",
        "4": "Very comprehensive with minor omissions",
        "3": "Adequately comprehensive but missing some details",
        "2": "Not very comprehensive, missing important aspects",
        "1": "Severely lacking in coverage of necessary information"
    }
)

answer_relevance_rubric = Rubric(
    name="Answer Relevance",
    description="Evaluates how relevant the response is to the question asked",
    scoring={
        "5": "Perfectly relevant and directly addresses the question",
        "4": "Highly relevant with minor tangential information",
        "3": "Mostly relevant but includes some irrelevant content",
        "2": "Somewhat relevant but misses the main point",
        "1": "Largely irrelevant to the question asked"
    }
)

EVAL_METRICS_PROMPT_TEMPLATE = """\
You are an expert evaluator of question-answer pairs. Analyze the following Q&A pair within the given context and evaluate it objectively.

## CONTEXT
{{context}}

## Q&A PAIR
{{qa_pair}}

Take a deep breath and carefully evaluate each criterion based on the provided rubrics.
"""

# Add evaluation metrics column
aidd.add_column(
    name="eval_metrics",
    type="llm-judge",
    prompt=EVAL_METRICS_PROMPT_TEMPLATE,
    rubrics=[clarity_rubric, factual_accuracy_rubric, comprehensiveness_rubric, answer_relevance_rubric]
)

## 9. Preview Sample Records

Let's generate a preview to see what our data will look like before running the full generation.

In [None]:
# Preview a Sample of Generated Records
preview = aidd.preview()
preview.display_sample_record()

Let's also explore the preview data as a DataFrame to better understand the structure:

In [None]:
# Explore the generated preview as a Pandas DataFrame
preview.dataset.df

## 10. Generate the Full Dataset

Now let's generate our full dataset of Q&A pairs:

In [None]:
# Submit batch job
workflow_run = aidd.create(
    num_records=100,
    name="qa_pair_generation"
)

workflow_run.wait_until_done()
print("\nGenerated dataset shape:", workflow_run.dataset.df.shape)

## 11. Analyze and Export the Dataset

Let's examine the generated Q&A pairs:

In [None]:
# Inspect the generated dataset
workflow_run.dataset.df.head(10)

# Export the dataset to JSONL format
workflow_run.dataset.df.to_json('qa_pairs.jsonl', orient='records', lines=True)
print("\nDataset exported to qa_pairs.jsonl")

## 12. Using Your Q&A Pairs Dataset

Now that you've generated a diverse Q&A dataset, here are some ways to use it:

1. **Model Training:** Use these pairs to fine-tune language models for specific domains
2. **Knowledge Base:** Create a searchable Q&A knowledge base for your documentation
3. **Testing:** Evaluate how well your models or search systems handle different types of questions
4. **Content Creation:** Generate FAQs, quizzes, or educational content
5. **Chatbot Development:** Provide a foundation for chatbot responses in your domain

The JSONL file contains structured data with questions, answers, and quality metrics that you can use across various applications.