<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/qa-generation/data_designer_sdk_qa_pairs.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 📚 Data Designer SDK: Generate Diverse Q&A Pairs

This notebook demonstrates a general approach for extracting Q&A pairs from any source document
(e.g. text, markdown, or PDF files). The generated Q&A pairs can be used for:
- **Instruction Tuning:** Training models with clear, self-contained examples.
- **Retrieval-Augmented Generation (RAG):** Enhancing retrieval systems with precise and context-supported Q&A pairs.
- **Search and FAQ Systems:** Powering natural language query systems and documentation.

> **Note:** The [Data Designer](https://docs.gretel.ai/create-synthetic-data/gretel-data-designer-beta) functionality demonstrated in this notebook is currently in **Early Preview**. To access these features and run this notebook, please [join the waitlist](https://gretel.ai/navigator/data-designer#waitlist).

# 📘 Getting Started

First, let's install and import the required packages:

In [None]:
# Install required libraries

%%capture
!pip install -qq langchain unstructured[pdf] smart_open git+https://github.com/gretelai/gretel-python-client

In [None]:
# Configuration
# -------------
# Define your source document(s) and the number of Q&A pairs to generate.
# You can replace this with your own documents in PDF, markdown, or text formats.

DOCUMENT_LIST = ["https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/rag_evals/databricks-state-of-data-ai-report.pdf"]
NUM_QA_PAIRS = 50

In [None]:
# Document Processing
# ------------------
# The DocumentProcessor class handles loading and chunking source documents for RAG evaluation.
# We use langchain's RecursiveCharacterTextSplitter and unstructured.io for robust document parsing.

from typing import List, Union
from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.auto import partition
from smart_open import open
import tempfile
import os

class DocumentProcessor:
    def __init__(self, chunk_size: int = 4192, chunk_overlap: int = 200):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )

    def parse_document(self, uri: str) -> str:
        """Parse a single document from URI into raw text."""
        with open(uri, 'rb') as file:
            content = file.read()
            with tempfile.NamedTemporaryFile(delete=False) as temp_file:
                temp_file.write(content)
                temp_file.flush()
                elements = partition(temp_file.name)

        os.unlink(temp_file.name)
        return "\n\n".join([str(element) for element in elements])

    def process_documents(self, uris: Union[str, List[str]]) -> List[str]:
        """Process one or more documents into chunks for RAG evaluation."""
        if isinstance(uris, str):
            uris = [uris]

        all_chunks = []
        for uri in uris:
            text = self.parse_document(uri)
            chunks = self.text_splitter.split_text(text)
            all_chunks.extend(chunks)

        return all_chunks

In [None]:
# Data Models
# -----------
# Define Pydantic models for structured output generation:
# 1. QAPair: Schema for question-answer pairs
# 2. EvalMetrics: Schema for scoring generation quality

from pydantic import BaseModel, Field
from typing import Optional, Literal

class QAPair(BaseModel):
    context: str = Field(..., description="The context used to make the question and answer.")
    question: str = Field(..., description="A clear and concise question derived from the context.")
    answer: str = Field(..., description="A detailed and accurate answer fully supported by the context.")


class EvalMetrics(BaseModel):
    clarity: int = Field(
        ...,
        description="How clear and understandable is the question? (1=vague/confusing, 5=perfectly clear and well-structured)",
        ge=1,
        le=5
    )
    factual_accuracy: int = Field(
        ...,
        description="Is the answer fully supported by the context? (1=contains errors/unsupported claims, 5=completely accurate and supported)",
        ge=1,
        le=5
    )
    comprehensiveness: int = Field(
        ...,
        description="Does the answer cover all necessary details? (1=missing crucial information, 5=complete coverage of all relevant points)",
        ge=1,
        le=5
    )
    answer_relevance: int = Field(
        ...,
        description="Does the answer directly address what was asked in the question? (1=misaligned/tangential, 5=perfectly addresses the question)",
        ge=1,
        le=5
    )

In [None]:
# Setup & Configure Data Designer using the new SDK
# --------------------------------
# Process document chunks first

import pandas as pd
from gretel_client.navigator_client import Gretel

# Process document chunks
processor = DocumentProcessor(chunk_size=4192, chunk_overlap=200)
chunks = processor.process_documents(DOCUMENT_LIST)

# Create a seed DataFrame with the document chunks
seed_df = pd.DataFrame({"context": chunks})

In [None]:
# Initialize Gretel client and Data Designer
gretel = Gretel(api_key="prompt", endpoint="https://api.dev.gretel.ai")
aidd = gretel.data_designer.new(model_suite="llama-3.x")  # or "apache-2.0" as needed

In [None]:
# Upload the seed dataset with document chunks
aidd.with_seed_dataset(seed_df, sampling_strategy="shuffle", with_replacement=True)

In [None]:
# Add categorical columns for controlled diversity
aidd.add_column(
    name="difficulty",
    type="category",
    params={"values": ["easy", "medium", "hard"]}
)

# Add sophistication as a subcategory of difficulty
aidd.add_column(
    name="sophistication",
    type="subcategory",
    params= {
        "category": "difficulty",
        "values": {
            "easy": [
                "basic", 
                "straightforward"
            ],
            "medium": [
                "intermediate", 
                "moderately complex"
            ],
            "hard": [
                "advanced", 
                "sophisticated"
            ]
        }
    }
)

# Add other categorical columns
aidd.add_column(
    name="question_style",
    type="category",
    params={"values": ["factual", "exploratory", "analytical", "comparative"]}
)

aidd.add_column(
    name="target_audience",
    type="category",
    params={"values": ["novice", "intermediate", "expert"]}
)

aidd.add_column(
    name="response_format",
    type="category",
    params={"values": ["short", "detailed", "step-by-step", "list"]}
)

# Add Q&A pair generation column
aidd.add_column(
    name="qa_pair",
    type="llm-generated",
    system_prompt="""You are an expert at generating high-quality, context-supported Q&A pairs.
Your output should be clear, concise, and factually correct.
Ensure that every question is self-contained and every answer is comprehensive and derived solely from the provided context.
""",
    prompt="""\n{context}\n\n
Based on the above context, generate a high-quality Q&A pair. The question should be clear and concise, 
and tailored for an audience at the '{target_audience}' level. The overall difficulty should be '{difficulty}', 
with a corresponding sophistication level (e.g., {sophistication}). The question style should be '{question_style}', 
and the answer should be provided in a '{response_format}' format. 
Ensure that the answer is factually accurate, fully derived from the context, and comprehensive.\n
Put your thoughts within <think>...</think> before providing the JSON.""",
    data_config={"type": "structured", "params": {"model": QAPair}}

    
)

# Add evaluation metrics column
aidd.add_column(
    name="eval_metrics",
    type="llm-generated",
    model_alias="judge",  # Use judge model for evaluation
    prompt="""\n{context}\n\n
For the above Q&A pair:\n
{qa_pair}\n\n
Rate each criterion on a scale of 1-5:\n
- Clarity\n
- Factual Accuracy\n
- Comprehensiveness\n
- Answer Relevance\n
Put your thoughts within <think>...</think> before providing the JSON.""",
    data_config={"type": "structured", "params": {"model": EvalMetrics}}
)

In [None]:
# Preview a Sample of Generated Records
preview = aidd.preview()
preview.display_sample_record()

In [None]:
# Explore the generated preview as a Pandas DataFrame
# ---------------------------

preview.dataset

In [None]:
# Submit batch job
workflow_run = aidd.create(
    num_records=100,
    workflow_run_name="qa_pair_generation",
    wait_for_completion=True
)
print("\nGenerated dataset shape:", workflow_run.dataset.df.shape)

In [None]:
# Inspect first 10 records of the generated dataset
workflow_run.dataset.df.head(10)