<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/data-designer-101/2-structured-outputs-and-jinja-expressions.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Data Designer 101: Structured Outputs and Jinja Expressions

In this notebook, we will continue our exploration of  `DataDesigner`, demonstrating more advanced data generation using structured outputs and Jinja expressions.


If this is your first time using `DataDesigner`, we recommend starting with the [first notebook](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/data-designer-101/1-the-basics.ipynb) in this 101 series.


<br>

### 💾 Install `gretel-client` and its dependencies

In [None]:
%%capture
%pip install -U gretel_client

In [1]:
from gretel_client.navigator_client import Gretel

# We import AIDD column and parameter types using this shorthand for convenience.
import gretel_client.data_designer.params as P
import gretel_client.data_designer.columns as C

# The Gretel object is the SDK's main entry point for interacting with Gretel's API.
gretel = Gretel(api_key="prompt")

Found cached Gretel credentials
Logged in as kirit.thadaka@gretel.ai ✅
Using project: default-sdk-project-1b613ec72030408
Project link: https://console-eng.gretel.ai/proj_2uY0cfM0kjiegpyEZvCHNKZYxGf


## 🧑‍🎨 Designing our data

- We will again create a product review dataset, but this time we will use structured outputs and Jinja expressions.

- Structured outputs lets you specify the exact schema of the data you want to generate. 

- `DataDesigner` supports schemas specified using either json schema or Pydantic data models (recommended).

<br>

We'll define our structured outputs using Pydantic data models:

In [2]:
from decimal import Decimal
from typing import Literal
from pydantic import BaseModel, Field

# We define a Product schema so that the name, description, and price are generated 
# in one go, with the types and constraints specified.
class Product(BaseModel):
    name: str = Field(description="The name of the product")
    description: str = Field(description="A description of the product")
    price: Decimal = Field(description="The price of the product", ge=10, le=1000, decimal_places=2)

class ProductReview(BaseModel):
    rating: int = Field(description="The rating of the product", ge=1, le=5)
    customer_mood: Literal["irritated", "mad", "happy", "neutral", "excited"] = Field(description="The mood of the customer")
    review: str = Field(description="A review of the product")

Next, let's design our product review dataset using a few more tricks compared to the previous notebook:

In [3]:
aidd = gretel.data_designer.new(model_suite="apache-2.0")

# Since we often just want a few attributes from Person objects, we can use 
# DataDesigner's `with_person_samplers` method to create multiple person samplers 
# at once and drop the person object columns from the final dataset.
aidd.with_person_samplers({"customer": P.PersonSamplerParams(age_range=[18, 65])})

aidd.add_column(
    C.SamplerColumn(
        name="product_category", 
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
                values=["Electronics", "Clothing", "Home & Kitchen", "Books", "Home Office"], 
            )
    )
)

aidd.add_column(    
    C.SamplerColumn(
        name="product_subcategory",
        type=P.SamplerType.SUBCATEGORY,
        params=P.SubcategorySamplerParams(
            category="product_category",  
            values={
                "Electronics": ["Smartphones", "Laptops", "Headphones", "Cameras", "Accessories"],
                "Clothing": ["Men's Clothing", "Women's Clothing", "Winter Coats", "Activewear", "Accessories"],
                "Home & Kitchen": ["Appliances", "Cookware", "Furniture", "Decor", "Organization"],
                "Books": ["Fiction", "Non-Fiction", "Self-Help", "Textbooks", "Classics"],
                "Home Office": ["Desks", "Chairs", "Storage", "Office Supplies", "Lighting"]
            }
        )
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="target_age_range",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(values=["18-25", "25-35", "35-50", "50-65", "65+"])
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="review_style",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["rambling", "brief", "detailed", "structured with bullet points"],
            weights=[1, 2, 2, 1]
        )
    )
)

# We can create new columns using Jinja expressions that reference 
# existing columns, including attributes of nested objects.
aidd.add_column(
    C.ExpressionColumn(
        name="customer_name",
        expr="{{ customer.first_name }} {{ customer.last_name }}"
    )
)

aidd.add_column(
    C.ExpressionColumn(
        name="customer_age",
        expr="{{ customer.age }}"
    )
)

# Add an `LLMStructuredColumn` column to generate structured outputs.
aidd.add_column(
    C.LLMStructuredColumn(
        name="product",
        prompt=(
            "Create a product in the '{{ product_category }}' category, focusing on products  "
            "related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
            "{{ target_age_range }} years old. The product should be priced between $10 and $1000."
        ),
        output_format=Product
    )
)

aidd.add_column(
    C.LLMStructuredColumn(
        name="customer_review",
        prompt=(
            "Your task is to write a review for the following product:\n\n"
            "Product Name: {{ product.name }}\n"
            "Product Description: {{ product.description }}\n"
            "Price: {{ product.price }}\n\n"
            "Imagine your name is {{ customer_name }} and you are from {{ customer.city }}, {{ customer.state }}. "
            "Write the review in a style that is '{{ review_style }}'."
        ),
        output_format=ProductReview
    )
)

# Let's add an evaluation report to our dataset.
aidd.with_evaluation_report().validate()

[12:41:19] [INFO] Validation passed ✅


## 👀 Preview the dataset

- Iteration is key to generating high-quality synthetic data.

- Use the `preview` method to generate 10 records for inspection.

- Setting `verbose_logging=True` prints logs within each task of the generation process.

In [4]:
preview = aidd.preview(verbose_logging=True)

[12:41:24] [INFO] 🚀 Generating preview
[12:41:25] [INFO] ⛓️ Representing generation steps as a Directed Acyclic Graph
[12:41:25] [INFO]   |-- 🔗 `customer_review` depends on `customer_name`
[12:41:25] [INFO]   |-- 🔗 `customer_review` depends on `product`
[12:41:26] [INFO] 🎲 Step 1: Using samplers to generate 5 columns
[12:41:26] [INFO]   |-- 🎲 👩‍🔬 Creating person generator
[12:41:26] [INFO]   |-- 🎲 Using numerical samplers to generate 10 records across 5 columns
[12:41:31] [INFO] 🦜 Step 2: Generating structured column `product`
[12:41:31] [INFO]   |-- 📝 Preparing template to generate data column `product`
[12:41:31] [INFO]   |   |-- model_alias: ModelAlias.STRUCTURED
[12:41:35] [INFO]   |-- Model usage: [{"model": "gretel/Qwen/Qwen2.5-Coder-32B-Instruct", "prompt_tokens": 2538, "completion_tokens": 808, "request_count": 10, "total_tokens": 3346}]
[12:41:36] [INFO] 💬 Step 3: Rendering expression column `customer_name`
[12:41:36] [INFO]   |-- 🧩 Generating column `customer_name` from expre

In [5]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset.df.head()

Unnamed: 0,product_category,product_subcategory,target_age_range,review_style,product,customer_name,customer_age,customer_review
0,Clothing,Women's Clothing,35-50,brief,"{""name"": ""Elegant Silk Blouse"", ""description"":...",Tina Harville,43,"{""rating"": 4, ""customer_mood"": ""happy"", ""revie..."
1,Home Office,Office Supplies,18-25,brief,"{""name"": ""Ergonomic Standing Desk Converter"", ...",Jennifer Freeman,42,"{""rating"": 5, ""customer_mood"": ""excited"", ""rev..."
2,Electronics,Accessories,35-50,brief,"{""name"": ""Ergonomic Mousepad for Professional ...",Mark Schockling,19,"{""rating"": 5, ""customer_mood"": ""happy"", ""revie..."
3,Home Office,Office Supplies,35-50,brief,"{""name"": ""Ergonomic Standing Desk Converter"", ...",Debbie Bennett,59,"{""rating"": 4, ""customer_mood"": ""happy"", ""revie..."
4,Clothing,Accessories,65+,brief,"{""name"": ""Comfortable Leather Bracelet"", ""desc...",John Taylor,38,"{""rating"": 4, ""customer_mood"": ""happy"", ""revie..."


In [6]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

## 🆙 Scale up!

- Once you are happy with the preview, scale up to a larger dataset by submitting a batch workflow.

- Setting `wait_until_done=True` will block until the workflow is complete.

- You can view the evaluation report by following the workflow link in the output of `create` below.

In [7]:
# This will take 5-10 minutes to complete.
workflow_run = aidd.create(
    num_records=100, 
    name="aidd-101-notebook-2-product-reviews",
    wait_until_done=True
)

[12:46:08] [INFO] 🚀 Submitting batch workflow
▶️ Creating Workflow: w_2wBnji97vLWXVY3YGQEzMetY34K
▶️ Created Workflow Run: wr_2wBnjrm603VhK5QDPmHNlQqSeV1
🔗 Workflow Run console link: https://console-dev.gretel.ai/workflows/w_2wBnji97vLWXVY3YGQEzMetY34K/runs/wr_2wBnjrm603VhK5QDPmHNlQqSeV1
Fetching task logs for workflow run wr_2wBnjrm603VhK5QDPmHNlQqSeV1
Workflow run is now in status: RUN_STATUS_CREATED
Got task wt_2wBnjpOeC805kjrLcrpRzOlHKft
Workflow run is now in status: RUN_STATUS_ACTIVE
[using-samplers-to-generate-5-columns] Task Status is now: RUN_STATUS_ACTIVE
[using-samplers-to-generate-5-columns] 2025-04-24 19:49:57.847603+00:00 Preparing step 'using-samplers-to-generate-5-columns'
[using-samplers-to-generate-5-columns] 2025-04-24 19:50:10.010225+00:00 Starting 'generate_columns_using_samplers' task execution
[using-samplers-to-generate-5-columns] 2025-04-24 19:50:10.011611+00:00 🎲 👨‍🍳 Creating person generator
[using-samplers-to-generate-5-columns] 2025-04-24 19:50:28.939749+00

In [None]:
# The generated dataset is available as a pandas DataFrame.
workflow_run.dataset.df.head()

In [7]:
# Fetch the evaluation report from the workflow run.
report = workflow_run.report

# If running in colab:
report.display_in_notebook()

# If running locally, we recommend displaying in the browser.
#report.display_in_browser()