<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/qa-generation/product-question-answer-generator.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Data Designer: Product Information Dataset Generator with Q&A

This notebook demonstrates how to use Gretel's Data Designer to create a synthetic dataset of product information with corresponding questions and answers. This dataset can be used for training and evaluating Q&A systems focused on product information.

The generator creates:
- Product details (name, features, description, price)
- User questions about the products
- AI-generated answers
- Evaluation metrics for answer quality

## Setup

First, let's install the required packages and initialize the Gretel client.

## Installing Required Packages

First, let's install the Gretel Python client from GitHub.

In [1]:
%%capture
!pip install -U git+https://github.com/gretelai/gretel-python-client@main

In [None]:
from gretel_client.navigator_client import Gretel

gretel = Gretel(
    api_key="prompt",  # Will prompt for your API key
    endpoint="https://api.dev.gretel.ai"  # Gretel API endpoint
)

# Initialize Data Designer with the Apache-2.0 model suite
aidd = gretel.data_designer.new(model_suite="apache-2.0")

## Defining Data Structures

Now we'll define the data models and evaluation rubrics for our product information dataset.

In [3]:
import string
from gretel_client.data_designer.params import Rubric
from pydantic import BaseModel, Field

# Define product information structure
class ProductInfo(BaseModel):
  product_name: str = Field(..., description="A realistic product name for the market.")
  key_features: list[str] = Field(..., min_length=1, max_length=3, description="Key product features.")
  description: str = Field(..., description="A short, engaging description of what the product does, highlighting a unique but believable feature.")
  price_usd: float = Field(..., description="The stated price in USD.")


# Define evaluation rubrics for answer quality
CompletenessRubric = Rubric(
    name="Completeness",
    description="Evaluation of AI assistant's thoroughness in addressing all aspects of the user's query.",
    scoring={
        "Complete": "The response thoroughly covers all key points requested in the question, providing sufficient detail to satisfy the user's information needs.",
        "PartiallyComplete": "The response addresses the core question but omits certain important details or fails to elaborate on relevant aspects that were requested.",
        "Incomplete": "The response significantly lacks necessary information, missing major components of what was asked and leaving the query largely unanswered.",
    }
)

AccuracyRubric = Rubric(
    name="Accuracy",
    description="Evaluation of how factually correct the AI assistant's response is relative to the product information.",
    scoring={
        "Accurate": "The information provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.",
        "PartiallyAccurate": "While some information is correctly stated, the response contains minor factual errors or potentially misleading statements about the product.",
        "Inaccurate": "The response presents significantly wrong information about the product, with claims that contradict the actual product details.",
    }
)

## Data Generation Workflow

Now we'll configure the data generation workflow to create product information, questions, and answers.

In [None]:
# Define product category options
aidd.add_column(
    name="category",
    type="category",
    params={"values": ['Electronics', 'Clothing', 'Home Appliances', 'Groceries', 'Toiletries', 
                       'Sports Equipment', 'Toys', 'Books', 'Pet Supplies', 'Tools & Home Improvement', 
                       'Beauty', 'Health & Wellness', 'Outdoor Gear', 'Automotive', 'Jewelry', 
                       'Watches', 'Office Supplies', 'Gifts', 'Arts & Crafts', 'Baby & Kids', 
                       'Music', 'Video Games', 'Movies', 'Software', 'Tech Devices']}
)

# Define price range to seed realistic product types
aidd.add_column(
    name="price_tens_of_dollars",
    type="uniform",
    params={"low": 1, "high": 200},
    convert_to="int"
)

aidd.add_column(
    name="product_price",
    type="expression",
    expr="{{ (price_tens_of_dollars * 10) - 0.01 | round(2) }}",
    dtype="float"
)

# Generate first letter for product name to ensure diversity
aidd.add_column(
    name="first_letter",
    type="category",
    params={"values": list(string.ascii_uppercase)}
)

# Generate product information
aidd.add_column(
    name="product_info",
    type="llm-structured",
    prompt="""\
Generate a realistic product description for a product in the {{ category }} category that costs {{ product_price }}.
The name of the product MUST start with the letter {{ first_letter }}.\
""",
    output_format=ProductInfo
)

# Generate user questions about the product
aidd.add_column(
    name="question",
    prompt="Ask a question about the following product:\n\n {{ product_info }}",
)

# Determine if this example will include hallucination
aidd.add_column(
  name="is_hallucination",
  type="bernoulli",
  params={"p": 0.5}
)

# Generate answers to the questions
aidd.add_column(
    name="answer",
    prompt="""\
{%- if is_hallucination == 0 -%}
<product_info>
{{ product_info }}
</product_info>

{%- endif -%}
User Question: {{ question }}

Directly and succinctly answer the user's question.\
{%- if is_hallucination == 1 -%}
 Make up whatever information you need to in order to answer the user's request.\
{%- endif -%}
"""
)

# Evaluate answer quality
aidd.add_column(
    name="llm_answer_metrics",
    type="llm-judge",
    prompt="""\
<product_info>
{{ product_info }}
</product_info>

User Question: {{question }}
AI Assistant Answer: {{ answer }}

Judge the AI assistant's response to the user's question about the product described in <product_info>.\
""",
    rubrics=[CompletenessRubric, AccuracyRubric]
)

# Extract metric scores for easier analysis
aidd.add_column(
    name="completeness_result",
    type="expression",
    expr="{{ llm_answer_metrics.Completeness.score }}"
)

aidd.add_column(
    name="accuracy_result",
    type="expression",
    expr="{{ llm_answer_metrics.Accuracy.score }}"
)

## Generate the Preview

Let's examine a sample record to understand the generated data.

In [None]:
# Preview the generated data
outs = aidd.preview()

In [None]:
outs.display_sample_record()

## Viewing the Dataset

We can view the entire preview dataset to understand the variety of products, questions, and answers generated.

In [None]:
outs.dataset.df

## Generating the Full Dataset

Now that we've verified our data model looks good, let's generate a full dataset with 1,000 records.

In [None]:
# Run the job
workflow_run = aidd.create(num_records=1_000, name="product_qa_dataset")

workflow_run.wait_until_done()