# 🎨 Navigator Data Designer SDK: Product Information Dataset Generator with Q&A

This notebook demonstrates how to use Gretel's Data Designer to create a synthetic dataset of product information with corresponding questions and answers. This dataset can be used for training and evaluating Q&A systems focused on product information.

The generator creates:
- Product details (name, features, description, price)
- User questions about the products
- AI-generated answers
- Evaluation metrics for answer quality

## Setup

First, let's install the required packages and initialize the Gretel client.

## Installing Required Packages

First, let's install the Gretel Python client from GitHub.

In [20]:
%%capture
!pip install git+https://github.com/gretelai/gretel-python-client

In [21]:
from gretel_client.navigator_client import Gretel

gretel = Gretel(
    api_key="prompt",  # Will prompt for your API key
    endpoint="https://api.dev.gretel.ai"  # Gretel API endpoint
)

# Initialize Data Designer with the Apache-2.0 model suite
aidd = gretel.data_designer.new(model_suite="apache-2.0")

Found cached Gretel credentials
Logged in as kirit.thadaka@gretel.ai ✅
Gretel client configured to use project: proj_2uY0cfM0kjiegpyEZvCHNKZYxGf


## Defining Data Structures

Now we'll define the data models and evaluation rubrics for our product information dataset.

In [22]:
import string
from gretel_client.data_designer.params import Rubric
from pydantic import BaseModel, Field

# Define product information structure
class ProductInfo(BaseModel):
  product_name: str = Field(..., description="A realistic product name for the market.")
  key_features: list[str] = Field(..., min_length=1, max_length=3, description="Key product features.")
  description: str = Field(..., description="A short, engaging description of what the product does, highlighting a unique but believable feature.")
  price_usd: float = Field(..., description="The stated price in USD.")


# Define evaluation rubrics for answer quality
CompletenessRubric = Rubric(
    name="Completeness",
    description="Evaluation of AI assistant's thoroughness in addressing all aspects of the user's query.",
    scoring={
        "Complete": "The response thoroughly covers all key points requested in the question, providing sufficient detail to satisfy the user's information needs.",
        "PartiallyComplete": "The response addresses the core question but omits certain important details or fails to elaborate on relevant aspects that were requested.",
        "Incomplete": "The response significantly lacks necessary information, missing major components of what was asked and leaving the query largely unanswered.",
    }
)

AccuracyRubric = Rubric(
    name="Accuracy",
    description="Evaluation of how factually correct the AI assistant's response is relative to the product information.",
    scoring={
        "Accurate": "The information provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.",
        "PartiallyAccurate": "While some information is correctly stated, the response contains minor factual errors or potentially misleading statements about the product.",
        "Inaccurate": "The response presents significantly wrong information about the product, with claims that contradict the actual product details.",
    }
)

## Data Generation Workflow

Now we'll configure the data generation workflow to create product information, questions, and answers.

In [23]:
# Define product category options
aidd.add_column(
    name="category",
    type="category",
    params={"values": ['Electronics', 'Clothing', 'Home Appliances', 'Groceries', 'Toiletries', 
                       'Sports Equipment', 'Toys', 'Books', 'Pet Supplies', 'Tools & Home Improvement', 
                       'Beauty', 'Health & Wellness', 'Outdoor Gear', 'Automotive', 'Jewelry', 
                       'Watches', 'Office Supplies', 'Gifts', 'Arts & Crafts', 'Baby & Kids', 
                       'Music', 'Video Games', 'Movies', 'Software', 'Tech Devices']}
)

# Define price range to seed realistic product types
aidd.add_column(
    name="price_tens_of_dollars",
    type="uniform",
    params={"low": 1, "high": 200},
    convert_to="int"
)

aidd.add_column(
    name="product_price",
    type="expression",
    expr="{{ (price_tens_of_dollars * 10) - 0.01 | round(2) }}",
    dtype="float"
)

# Generate first letter for product name to ensure diversity
aidd.add_column(
    name="first_letter",
    type="category",
    params={"values": list(string.ascii_uppercase)}
)

# Generate product information
aidd.add_column(
    name="product_info",
    prompt="""\
Generate a realistic product description for a product in the {{ category }} category that costs {{ product_price }}.
The name of the product MUST start with the letter {{ first_letter }}.\
""",
    data_config={"type": "structured", "params": {"model": ProductInfo}}
)

# Generate user questions about the product
aidd.add_column(
    name="question",
    prompt="Ask a question about the following product:\n\n {{ product_info }}",
)

# Determine if this example will include hallucination
aidd.add_column(
  name="is_hallucination",
  type="bernoulli",
  params={"p": 0.5}
)

# Generate answers to the questions
aidd.add_column(
    name="answer",
    prompt="""\
{%- if is_hallucination == 0 -%}
<product_info>
{{ product_info }}
</product_info>

{%- endif -%}
User Question: {{ question }}

Directly and succinctly answer the user's question.\
{%- if is_hallucination == 1 -%}
 Make up whatever information you need to in order to answer the user's request.\
{%- endif -%}
"""
)

# Evaluate answer quality
aidd.add_column(
    name="llm_answer_metrics",
    type="llm-judge",
    prompt="""\
<product_info>
{{ product_info }}
</product_info>

User Question: {{question }}
AI Assistant Answer: {{ answer }}

Judge the AI assistant's response to the user's question about the product described in <product_info>.\
""",
    rubrics=[CompletenessRubric, AccuracyRubric]
)

# Extract metric scores for easier analysis
aidd.add_column(
    name="completeness_result",
    type="expression",
    expr="{{ llm_answer_metrics.Completeness.score }}"
)

aidd.add_column(
    name="accuracy_result",
    type="expression",
    expr="{{ llm_answer_metrics.Accuracy.score }}"
)

## Generate the Preview

Let's examine a sample record to understand the generated data.

In [24]:
# Preview the generated data
outs = aidd.preview()

[12:15:56] [INFO] 🚀 Generating preview
[12:15:58] [INFO] 🦜 Step 1: Generate columns using samplers
[12:15:58] [INFO] 🦜 Step 2: Generate column from expression
[12:15:59] [INFO] 🦜 Step 3: Generate column from template
[12:16:14] [INFO] 🦜 Step 4: Generate column from template 1
[12:16:20] [INFO] 🦜 Step 5: Generate column from template 2
[12:16:33] [INFO] ⚖️ Step 6: Judge with llm
[12:16:48] [INFO] 🦜 Step 7: Generate column from expression 1
[12:16:49] [INFO] 🦜 Step 8: Generate column from expression 2
[12:16:49] [INFO] 🎉 Your dataset preview is ready!


In [25]:
outs.display_sample_record()

## Viewing the Dataset

We can view the entire preview dataset to understand the variety of products, questions, and answers generated.

In [26]:
outs.dataset.df

Unnamed: 0,category,price_tens_of_dollars,first_letter,is_hallucination,product_price,product_info,question,answer,judged_by_llm,llm_answer_metrics,completeness_result,accuracy_result
0,Toys,88,T,1,879.99,"{'product_name': 'TecnoBuild Robotics Kit', 'k...",What age range is the TecnoBuild Robotics Kit ...,The TecnoBuild Robotics Kit is suitable for ch...,True,{'Completeness': {'reasoning': 'The AI assista...,Incomplete,Inaccurate
1,Software,72,J,0,719.99,"{'product_name': 'Juno AI Content Generator', ...","How does the ""Juno AI Content Generator"" ensur...","The ""Juno AI Content Generator"" ensures the qu...",True,{'Completeness': {'reasoning': 'The response a...,PartiallyComplete,PartiallyAccurate
2,Software,155,O,1,1549.99,"{'product_name': 'OmniTask Pro', 'key_features...",How does OmniTask Pro's AI-driven prioritizati...,OmniTask Pro's AI-driven prioritization system...,True,{'Completeness': {'reasoning': 'The response p...,Complete,PartiallyAccurate
3,Music,160,F,1,1599.99,{'product_name': 'FusionX Wireless Headphones'...,How does the Active Noise Cancellation feature...,The Active Noise Cancellation (ANC) feature in...,True,{'Completeness': {'reasoning': 'The response a...,PartiallyComplete,Inaccurate
4,Health & Wellness,8,F,1,79.99,"{'product_name': 'FitnessSync Smart Watch', 'k...",How accurate is the heart rate monitoring feat...,The heart rate monitoring feature on the Fitne...,True,{'Completeness': {'reasoning': 'The AI assista...,Incomplete,Inaccurate
5,Baby & Kids,149,T,1,1489.99,{'product_name': 'Toddler Smart Learning Table...,How does the parental control settings on the ...,The Toddler Smart Learning Tablet's parental c...,True,{'Completeness': {'reasoning': 'The AI assista...,Complete,PartiallyAccurate
6,Music,191,Q,1,1909.99,{'product_name': 'Quantum Symphony Headphones'...,How does the Active Noise Cancellation feature...,The Active Noise Cancellation (ANC) feature in...,True,{'Completeness': {'reasoning': 'The response i...,PartiallyComplete,Inaccurate
7,Jewelry,164,Q,0,1639.99,{'product_name': 'Quintessa Eternal Radiance D...,How does the 1.5 carat center diamond in the Q...,The 1.5 carat center diamond in the Quintessa ...,True,{'Completeness': {'reasoning': 'The response a...,Complete,Accurate
8,Clothing,99,I,1,989.99,"{'product_name': 'Iris Smart Jacket', 'key_fea...",How does the Iris Smart Jacket's integrated he...,The Iris Smart Jacket features an integrated h...,True,{'Completeness': {'reasoning': 'The AI assista...,Incomplete,Inaccurate
9,Watches,140,R,1,1399.99,"{'product_name': 'RadianTech Smartwatch', 'key...",How does the RadianTech Smartwatch's heart rat...,The RadianTech Smartwatch's heart rate and sle...,True,{'Completeness': {'reasoning': 'The response t...,Complete,Accurate


## Generating the Full Dataset

Now that we've verified our data model looks good, let's generate a full dataset with 1,000 records.

In [27]:
# # Run the job
workflow_run = aidd.create(num_records=1_000, workflow_run_name="product_qa_dataset")

[12:16:49] [INFO] 🚀 Submitting batch workflow
▶️ Creating Workflow: w_2vSYC8zobr2HdsEwCbGJDjrDZnV
▶️ Created Workflow Run: wr_2vSYCDqs0a6G3pjTYpHJZp29lDU
🔗 Workflow Run console link: https://console-dev.gretel.ai/workflows/w_2vSYC8zobr2HdsEwCbGJDjrDZnV/runs/wr_2vSYCDqs0a6G3pjTYpHJZp29lDU


In [33]:
run = gretel.workflows.get_workflow_run(workflow_run_id=workflow_run.id)

In [None]:
run.poll()