<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/getting-started/data-designer-101.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Data Designer 101: A Comprehensive Guide

Welcome to this comprehensive introduction to Gretel's Data Designer! This notebook will walk you through the essential concepts and techniques you need to generate high-quality synthetic data for your projects.

## What is Data Designer?

Data Designer is a powerful tool in Gretel's ecosystem that allows you to programmatically define and generate synthetic data with precise control over structure, relationships, and statistical properties. Whether you need test data for development, synthetic data for privacy protection, or training data for AI models, Data Designer provides a flexible and powerful solution.

## What You'll Learn

By the end of this tutorial, you'll be able to:
- Create synthetic datasets with various column types
- Define relationships between columns using expressions
- Generate realistic person data with demographic information
- Use statistical distributions to create realistic numeric data
- Leverage LLMs to generate contextual text data
- Add constraints to ensure data validity
- Preview and generate your final dataset

Let's get started!

## Installation and Setup

First, let's install the necessary packages and set up our environment.

In [None]:
%%capture
# Install the latest version of Gretel client and dependencies
%pip install -U gretel_client 

In [21]:
# Import necessary libraries
import pandas as pd
from datetime import datetime, timedelta

from gretel_client.navigator_client import Gretel
from gretel_client.data_designer import columns as C
from gretel_client.data_designer import params as P

# Initialize the Gretel client
gretel = Gretel(
    api_key="prompt",  # This will prompt for your API key
    endpoint="https://api.dev.gretel.ai"  # Update with your endpoint if different
)

# Create a new Data Designer object
model_suite = "apache-2.0"  # This specifies the model suite to use
dd = gretel.data_designer.new(model_suite=model_suite)

Found cached Gretel credentials
Logged in as kirit.thadaka@gretel.ai ✅
Gretel client configured to use project: proj_2uY0cfM0kjiegpyEZvCHNKZYxGf


## Part 1: Understanding Column Types

Data Designer offers several types of columns to generate different kinds of data. Let's explore each of these types:

1. **Sampler Columns** - Generate data from predefined distributions or categories
2. **Expression Columns** - Create data using Jinja expressions that can reference other columns
3. **LLM-Generated Columns** - Use large language models to create contextual text

### Sampler Columns

Let's start with some basic sampler columns. These are useful for generating structured data from predefined sources or distributions.

In [22]:
# Category sampler - useful for discrete categories like product types, statuses, etc.
dd.add_column(
    C.SamplerColumn(
        name="product_category",
        type=P.SamplingSourceType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books", "Toys"],
            weights=[0.3, 0.25, 0.2, 0.15, 0.1]  # Optional: control the distribution
        )
    )
)

# Numerical samplers - for generating numbers from statistical distributions
dd.add_column(
    C.SamplerColumn(
        name="price",
        type=P.SamplingSourceType.GAUSSIAN,  # Normal distribution
        params=P.GaussianSamplerParams(
            mean=100.0,
            stddev=30.0,
            min=10.0,  # Ensure no negative prices
            max=500.0  # Cap the maximum price
        ),
        convert_to="float"  # Specify the output type
    )
)

dd.add_column(
    C.SamplerColumn(
        name="quantity_in_stock",
        type=P.SamplingSourceType.POISSON,  # Good for count data
        params=P.PoissonSamplerParams(mean=50)
    )
)

# DateTime samplers - for generating dates and times
dd.add_column(
    C.SamplerColumn(
        name="listed_date",
        type=P.SamplingSourceType.DATETIME,
        params=P.DatetimeSamplerParams(
            start="2023-01-01",
            end="2023-12-31"
        ),
        convert_to="%Y-%m-%d"  # Format as YYYY-MM-DD
    )
)

# UUID sampler - for generating unique identifiers
dd.add_column(
    C.SamplerColumn(
        name="product_id",
        type=P.SamplingSourceType.UUID,
        params=P.UUIDParams(
            prefix="PROD-",  # Add a prefix to the UUID
            short_form=True  # Generate shorter UUIDs for readability
        )
    )
)

# Let's preview what we have so far
dd.validate()
preview = dd.preview()
preview.dataset.df[["product_category", "product_subcategory"]]

[14:23:33] [INFO] Validation passed ✅
[14:23:33] [INFO] 🚀 Generating preview
[14:23:35] [INFO] 🦜 Step 1: Generate columns using samplers
[14:23:37] [INFO] 🎉 Your dataset preview is ready!


KeyError: "['product_subcategory'] not in index"

### Subcategory Sampler

Now let's explore a more advanced sampler: the subcategory sampler. This allows you to create hierarchical relationships between categorical values.

In [None]:
# Subcategory sampler - create relationships between categories
dd.add_column(
    C.SamplerColumn(
        name="product_subcategory",
        type=P.SamplingSourceType.SUBCATEGORY,
        params=P.SubcategoryParams(
            category="product_category",  # Parent column
            values={
                "Electronics": ["Smartphones", "Laptops", "Headphones", "Cameras", "Accessories"],
                "Clothing": ["Men's", "Women's", "Children's", "Activewear", "Accessories"],
                "Home & Kitchen": ["Appliances", "Cookware", "Furniture", "Decor", "Organization"],
                "Books": ["Fiction", "Non-Fiction", "Children's", "Textbooks", "Reference"],
                "Toys": ["Action Figures", "Board Games", "Educational", "Outdoor", "Plush"]
            }
        )
    )
)

# Preview to see how subcategories relate to categories
dd.validate()
preview = dd.preview()
preview.dataset.df[["product_category", "product_subcategory"]]

### Expression Columns

Expression columns allow you to define values based on other columns, using Jinja templates. This is powerful for creating logical relationships in your data.

In [None]:
# Simple expression - combining values from other columns
dd.add_column(
    C.ExpressionColumn(
        name="product_title",
        expr="{{ product_subcategory }} {{ product_category }} (ID: {{ product_id }})"
    )
)

# Expressions with calculations
dd.add_column(
    C.ExpressionColumn(
        name="total_value",
        expr="{{ price * quantity_in_stock | round(2) }}"
    )
)

# Expressions with conditional logic
dd.add_column(
    C.ExpressionColumn(
        name="stock_status",
        expr="{% if quantity_in_stock == 0 %}Out of Stock{% elif quantity_in_stock < 10 %}Low Stock{% else %}In Stock{% endif %}"
    )
)

# Preview the expressions
preview = dd.preview()
preview.dataset.df[["product_title", "quantity_in_stock", "total_value", "stock_status"]]

## Part 2: Generating Person Data

Data Designer includes powerful capabilities for generating realistic person data. Let's explore these features.

In [None]:
# Generate customer data using the person sampler
dd.add_column(
    C.SamplerColumn(
        name="customer",  # This creates a nested object with all person attributes
        type=P.SamplingSourceType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",  # Set the locale for appropriate formatting
            age_range=[18, 80]     # Maximum age
        )
    )
)

# Extract specific attributes from the customer object for easier use
dd.add_column(
    C.ExpressionColumn(
        name="customer_name",
        expr="{{ customer.first_name }} {{ customer.last_name }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="customer_email",
        expr="{{ customer.email_address }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="customer_location",
        expr="{{ customer.city }}, {{ customer.state }}"
    )
)

# Let's add a second person for shipping recipient (could be self or other)
dd.add_column(
    C.SamplerColumn(
        name="ship_to_self",
        type=P.SamplingSourceType.BERNOULLI,  # Boolean with probability
        params=P.BernoulliSamplerParams(p=0.7)  # 70% chance to ship to self
    )
)

dd.add_column(
    C.SamplerColumn(
        name="recipient",  # Second person
        type=P.SamplingSourceType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",
            age_range=[18, 80]
        )
    )
)

# Use conditional expressions to determine shipping details
dd.add_column(
    C.ExpressionColumn(
        name="shipping_name",
        expr="{% if ship_to_self %}{{ customer.first_name }} {{ customer.last_name }}{% else %}{{ recipient.first_name }} {{ recipient.last_name }}{% endif %}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="shipping_address",
        expr="{% if ship_to_self %}{{ customer.street_number }} {{ customer.street_name }}, {{ customer.city }}, {{ customer.state }} {{ customer.zipcode }}{% else %}{{ recipient.street_number }} {{ recipient.street_name }}, {{ recipient.city }}, {{ recipient.state }} {{ recipient.zipcode }}{% endif %}"
    )
)

# Preview customer and shipping information
dd.validate()
preview = dd.preview()
preview.dataset.df[["customer_name", "customer_email", "shipping_name", "shipping_address", "ship_to_self"]]

## Part 3: Order and Transaction Data

Now let's combine what we've learned to create a more complex dataset with order and transaction information.

In [None]:
# Generate an order ID
dd.add_column(
    C.SamplerColumn(
        name="order_id",
        type=P.SamplingSourceType.UUID,
        params=P.UUIDParams(
            prefix="ORD-",
            short_form=True,
            uppercase=True
        )
    )
)

# Order date (after product listing date)
dd.add_column(
    C.SamplerColumn(
        name="order_date",
        type=P.SamplingSourceType.TIMEDELTA,
        params=P.TimeDeltaParams(
            dt_min=1,     # Minimum days after listing date
            dt_max=90,    # Maximum days after listing date
            reference_column_name="listed_date", # Reference date
            unit="D"      # Unit is days
        ),
        convert_to="%Y-%m-%d"
    )
)

# Order quantity
dd.add_column(
    C.SamplerColumn(
        name="order_quantity",
        type=P.SamplingSourceType.UNIFORM,
        params=P.UniformSamplerParams(low=1, high=5)
    )
)

# Calculate total order amount
dd.add_column(
    C.ExpressionColumn(
        name="order_total",
        expr="{{ (price * order_quantity) | round(2) }}"
    )
)

# Shipping cost based on order total
dd.add_column(
    C.ExpressionColumn(
        name="shipping_cost",
        expr="{% if order_total > 50 %}0.00{% else %}5.99{% endif %}"
    )
)

# Calculate final amount
dd.add_column(
    C.ExpressionColumn(
        name="final_amount",
        expr="{{ (order_total + shipping_cost | float) | round(2) }}"
    )
)

# Generate payment method
dd.add_column(
    C.SamplerColumn(
        name="payment_method",
        type=P.SamplingSourceType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Credit Card", "Debit Card", "PayPal", "Apple Pay", "Google Pay"],
            weights=[0.4, 0.3, 0.15, 0.1, 0.05]
        )
    )
)

# Order status
dd.add_column(
    C.SamplerColumn(
        name="order_status",
        type=P.SamplingSourceType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Delivered", "Shipped", "Processing", "Cancelled"],
            weights=[0.6, 0.2, 0.15, 0.05]
        )
    )
)

# Preview order information
dd.validate()
preview = dd.preview()
preview.dataset.df[["order_id", "product_title", "order_quantity", "order_total", "shipping_cost", "final_amount", "order_status"]]

## Part 4: LLM-Generated Columns

One of the most powerful features of Data Designer is the ability to use large language models (LLMs) to generate text based on the context. Let's use this to create product descriptions and customer reviews.

In [None]:
# Generate product descriptions using LLM
dd.add_column(
    C.LLMGenColumn(
        name="product_description",
        prompt=(
            "Write a concise 2-3 sentence product description for a {{ product_subcategory }} in the {{ product_category }} "
            "category. The product costs ${{ price }}. Make it informative and appealing to potential customers. "
            "Do not use bullet points."
        )
    )
)

# Generate customer reviews using LLM (conditionally based on order status)
dd.add_column(
    C.LLMGenColumn(
        name="customer_review",
        prompt=(
            "{% if order_status == 'Delivered' %}"
            "Write a brief customer review (1-2 sentences) from {{ customer_name }} about their purchase of a "
            "{{ product_subcategory }} in the {{ product_category }} category. The customer paid ${{ price }} per item "
            "and ordered {{ order_quantity }} units. "
            "{% if price > 100 %}The product is in the premium range.{% else %}The product is in the standard range.{% endif %} "
            "{% if order_quantity > 1 %}The customer bought multiple units.{% endif %} "
            "Give the review a rating out of 5 stars."
            "{% else %}"
            "No review available yet."
            "{% endif %}"
        )
    )
)

# Generate order notes using LLM with different conditions
dd.add_column(
    C.LLMGenColumn(
        name="order_notes",
        prompt=(
            "{% if order_status == 'Cancelled' %}"
            "Write a brief note (1 sentence) explaining why customer {{ customer_name }} cancelled their order "
            "for a {{ product_subcategory }}."
            "{% elif order_status == 'Processing' %}"
            "Write a brief processing note (1 sentence) for order {{ order_id }} placed by {{ customer_name }}."
            "{% elif order_status == 'Shipped' %}"
            "Write a brief shipping note (1 sentence) for order {{ order_id }} being delivered to {{ shipping_name }} "
            "in {{ recipient.city }}, {{ recipient.state }}."
            "{% else %}"
            "Order {{ order_id }} was successfully delivered to {{ shipping_name }}."
            "{% endif %}"
        )
    )
)

# Preview LLM-generated content
dd.validate()
preview = dd.preview()
preview.dataset.df[["product_title", "product_description", "customer_review", "order_notes"]]

## Part 5: Adding Constraints

To ensure our data makes logical sense, we can add constraints that enforce relationships between columns.

In [None]:
# Ensure order quantity can't exceed quantity in stock
dd.add_constraint(
    target_column="order_quantity",
    type="column_inequality",
    params={
        "operator": "le",  # less than or equal to
        "rhs": "quantity_in_stock"
    }
)

# Ensure order date is after listing date
dd.add_constraint(
    target_column="order_date",
    type="column_inequality",
    params={
        "operator": "ge",  # greater than or equal to
        "rhs": "listed_date"
    }
)

# Constrain price to be positive
dd.add_constraint(
    target_column="price",
    type="scalar_inequality",
    params={
        "operator": "gt",  # greater than
        "rhs": 0
    }
)

## Part 6: Generating the Final Dataset

Now that we've designed our schema, let's generate a larger dataset and save it.

In [None]:
# Preview a single record in more detail
dd.validate()
preview = dd.preview()
preview.display_sample_record()  # This shows a nicely formatted single record

In [None]:
# Define a name for our dataset
workflow_name = "synthetic-ecommerce-data"

# Generate a dataset with 100 records
workflow_run = dd.create(
    num_records=100,  # Generate 100 records
    name=workflow_name,
    wait_for_completion=True  # Wait until generation is complete
)

print(f"Generated dataset shape: {workflow_run.dataset.df.shape}")

In [None]:
# Display a sample of the generated dataset
workflow_run.dataset.df.head()

In [None]:
# Save the dataset to a CSV file
csv_filename = f"{workflow_name}.csv"
workflow_run.dataset.df.to_csv(csv_filename, index=False)
print(f"Dataset saved to {csv_filename}")

## Part 7: Working with Seed Data

Data Designer allows you to jumpstart your synthetic data generation by using existing data as a foundation. This powerful capability, known as "seeding," lets you:

- Use real data samples as templates for synthetic generation
- Create variations of existing datasets while maintaining their core structure

In [None]:
# Example 1: Using an in-memory DataFrame as seed data
import pandas as pd

# Create a sample DataFrame (in a real scenario, this could be your existing data)
seed_data = pd.DataFrame({
    'product_name': ['Premium Wireless Headphones', 'Smart Fitness Tracker', 'Portable Bluetooth Speaker', 'Ultra HD Monitor'],
    'manufacturer': ['AudioTech', 'FitLife', 'SoundWave', 'VisualPro'],
    'base_price': [129.99, 89.99, 49.99, 249.99],
    'rating': [4.7, 4.2, 4.5, 4.8]
})

print("Original seed data shape:", seed_data.shape)
display(seed_data)

# Create a new Data Designer instance
seed_dd = gretel.data_designer.new(model_suite=model_suite)

# Add the seed dataset to Data Designer
seed_dd.with_seed_dataset(
    seed_data,                    # Your DataFrame
    sampling_strategy="shuffle",  # Options: "shuffle", "sequential", or "weighted"
    with_replacement=True,        # Whether to allow sampling the same row multiple times
)

In [None]:
# Example 2: Loading seed data from a file
# In practice, you would specify a real file path; this is just for demonstration

# Uncomment this code to use with your own file
"""
# Load data from a CSV file
file_seed_dd = gretel.data_designer.new(model_suite=model_suite)

# Load data using pandas - works with CSV, Excel, JSON, etc.
file_seed_df = pd.read_csv("your_data.csv")  
# Alternative: file_seed_df = pd.read_excel("your_data.xlsx")
# Alternative: file_seed_df = pd.read_json("your_data.json")

# Add the file-based seed dataset
file_seed_dd.with_seed_dataset(
    file_seed_df,
    sampling_strategy="shuffle",
    with_replacement=True  # Set to True to generate datasets larger than your seed
)
"""

# For our example, we'll continue with our in-memory DataFrame
# Now, let's enhance the seed data with synthetic columns

# Add a discount column based on statistical distribution
seed_dd.add_column(
    C.SamplerColumn(
        name="discount_percent",
        type=P.SamplingSourceType.UNIFORM,
        params=P.UniformSamplerParams(low=0.0, high=0.3),
        convert_to="float"
    )
)

# Add a derived column that calculates the sale price
seed_dd.add_column(
    C.ExpressionColumn(
        name="sale_price",
        expr="{{ (base_price * (1 - discount_percent)) | round(2) }}"
    )
)

# Add an LLM-generated product description referencing the seed data
seed_dd.add_column(
    C.LLMGenColumn(
        name="product_description",
        prompt=(
            "Write a concise 2-3 sentence description for {{ product_name }} made by {{ manufacturer }}. "
            "The product has a rating of {{ rating }} out of 5 stars and costs ${{ base_price }}. "
            "Now on sale for ${{ sale_price }}."
        )
    )
)

In [None]:
# Preview the enhanced seed-based data
# We can generate more records than our original seed data when using with_replacement=True
seed_dd.validate()
seed_preview = seed_dd.preview()  # Generate 8 records from our 4-row seed
seed_preview.dataset.df

## Part 8: Structured Outputs

When generating synthetic data with LLMs, you often need to ensure that the output follows a specific format or structure. This is especially important for applications that expect data to have a certain schema, such as APIs, databases, or data processing pipelines.

Data Designer's structured output feature allows you to define precise schemas that LLM-generated data must conform to. 

Let's explore two ways to define structured outputs: using Pydantic models and using JSON Schema.

In [15]:
# Option 1: Using Pydantic Models for Structured Output
# Pydantic is a data validation library that makes it easy to define data models

from pydantic import BaseModel, Field, conlist
from typing import List, Optional

# Define a model for individual review ratings
class RatingBreakdown(BaseModel):
    quality: int = Field(..., description="Rating for product quality (1-5)")
    value: int = Field(..., description="Rating for price-to-value ratio (1-5)")
    durability: int = Field(..., description="Rating for product durability (1-5)")
    appearance: int = Field(..., description="Rating for product appearance (1-5)")
    ease_of_use: int = Field(..., description="Rating for ease of use (1-5)")

# Define a model for product pros and cons
class ProsAndCons(BaseModel):
    pros: List[str] = Field(..., description="List of product pros/positive aspects")
    cons: List[str] = Field(..., description="List of product cons/negative aspects")

# Define a model for a detailed product review
class DetailedReview(BaseModel):
    overall_rating: float = Field(..., description="Overall product rating (1.0-5.0)")
    title: str = Field(..., description="Review title")
    content: str = Field(..., description="Main review content")
    rating_breakdown: RatingBreakdown = Field(..., description="Detailed ratings by category")
    pros_and_cons: ProsAndCons = Field(..., description="Lists of pros and cons")
    verified_purchase: bool = Field(..., description="Whether this is a verified purchase")
    would_recommend: bool = Field(..., description="Whether the reviewer would recommend this product")
    usage_duration: Optional[str] = Field(None, description="How long the reviewer has used the product")

# Create a new Data Designer instance
structured_dd = gretel.data_designer.new(model_suite=model_suite)

In [None]:
# Set up basic product information
structured_dd.add_column(
    name="product_category",
    type="category",
    params={"values": ["Electronics", "Home & Kitchen", "Sports & Outdoors", "Beauty & Personal Care"]}
)

structured_dd.add_column(
    name="product_name",
    type="llm-text",
    prompt="Generate a realistic product name for a {{ product_category }} item."
)

# Now, here's the key part - using our Pydantic model to structure the LLM output
structured_dd.add_column(
    name="detailed_review",
    type="llm-structured",
    prompt=(
        "Write a detailed product review for a {{ product_name }} in the {{ product_category }} category. "
        "The review should be detailed and realistic, covering both positive and negative aspects."
    ),
    # The output_format parameter tells the LLM to generate data conforming to our model
    output_format=DetailedReview
)

In [None]:
# Preview the structured output - note how it follows our Pydantic schema exactly
structured_dd.validate()
preview = structured_dd.preview()
preview.dataset.df

# Display a nicely formatted sample to better see the structured data
preview.display_sample_record()

In [None]:
# Option 2: Using JSON Schema for Structured Output
# JSON Schema is a standard for describing the structure of JSON data

# Define a detailed review schema using JSON Schema that matches our Pydantic model
review_schema = {
    "type": "object",
    "properties": {
        "overall_rating": {
            "type": "number",
            "description": "Overall product rating (1.0-5.0)"
        },
        "title": {
            "type": "string",
            "description": "Review title"
        },
        "content": {
            "type": "string",
            "description": "Main review content"
        },
        "rating_breakdown": {
            "type": "object",
            "properties": {
                "quality": {
                    "type": "integer",
                    "description": "Rating for product quality (1-5)"
                },
                "value": {
                    "type": "integer",
                    "description": "Rating for price-to-value ratio (1-5)"
                },
                "durability": {
                    "type": "integer",
                    "description": "Rating for product durability (1-5)"
                },
                "appearance": {
                    "type": "integer",
                    "description": "Rating for product appearance (1-5)"
                },
                "ease_of_use": {
                    "type": "integer",
                    "description": "Rating for ease of use (1-5)"
                }
            },
            "required": ["quality", "value", "durability", "appearance", "ease_of_use"],
            "description": "Detailed ratings by category"
        },
        "pros_and_cons": {
            "type": "object",
            "properties": {
                "pros": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "List of product pros/positive aspects"
                },
                "cons": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "List of product cons/negative aspects"
                }
            },
            "required": ["pros", "cons"],
            "description": "Lists of pros and cons"
        },
        "verified_purchase": {
            "type": "boolean",
            "description": "Whether this is a verified purchase"
        },
        "would_recommend": {
            "type": "boolean",
            "description": "Whether the reviewer would recommend this product"
        },
        "usage_duration": {
            "type": ["string", "null"],
            "description": "How long the reviewer has used the product"
        }
    },
    "required": [
        "overall_rating", 
        "title", 
        "content", 
        "rating_breakdown", 
        "pros_and_cons", 
        "verified_purchase", 
        "would_recommend"
    ]
}

# Create another Data Designer instance for our JSON Schema example
json_schema_dd = gretel.data_designer.new(model_suite=model_suite)

# Use the same product info setup as the Pydantic example
json_schema_dd.add_column(
    name="product_category",
    type="category",
    params={"values": ["Electronics", "Home & Kitchen", "Sports & Outdoors", "Beauty & Personal Care"]}
)

json_schema_dd.add_column(
    name="product_name",
    type="llm-text",
    prompt="Generate a realistic product name for a {{ product_category }} item."
)

# Add the structured review column using JSON Schema
json_schema_dd.add_column(
    name="detailed_review",
    type="llm-structured",
    prompt=(
        "Write a detailed product review for a {{ product_name }} in the {{ product_category }} category. "
        "The review should be detailed and realistic, covering both positive and negative aspects."
    ),
    # Note how we use json_schema instead of model for JSON Schema-based structured output
    output_format=review_schema
)

In [None]:
# Preview the JSON Schema structured output
preview = json_schema_dd.preview()
preview.display_sample_record()

### Submit a batch job
Let's generate a small dataset of review records and analyze them

In [None]:


# Generate 5 review records
review_results = json_schema_dd.create(
    num_records=100,
    name="structured_reviews",
    wait_for_completion=True
)

# Access the structured data
reviews_df = review_results.dataset.df

# With structured data, you can access nested fields using dot notation or by parsing the JSON
# Here we'll look at the review titles and overall ratings
reviews_df[['detailed_review.title', 'detailed_review.overall_rating']]

In [None]:
# Extract and analyze data from our structured output
import pandas as pd
import json

# Function to extract data from the structured review JSON
def extract_review_info(row):
    review = json.loads(row['detailed_review']) if isinstance(row['detailed_review'], str) else row['detailed_review']
    
    # Extract the rating breakdown sub-object
    ratings = review['rating_breakdown']
    
    # Create a series with the extracted data
    return pd.Series({
        'overall_rating': review['overall_rating'],
        'title_length': len(review['title']),
        'content_length': len(review['content']),
        'quality_rating': ratings['quality'],
        'value_rating': ratings['value'],
        'durability_rating': ratings['durability'],
        'appearance_rating': ratings['appearance'],
        'ease_of_use_rating': ratings['ease_of_use'],
        'pros_count': len(review['pros_and_cons']['pros']),
        'cons_count': len(review['pros_and_cons']['cons']),
        'verified_purchase': review['verified_purchase'],
        'would_recommend': review['would_recommend'],
        'product_category': row['product_category']
    })

# Apply our extraction function to create a new DataFrame with the extracted data
review_details = reviews_df.apply(extract_review_info, axis=1)

# Now we can easily analyze the data - for example, calculating averages by product category
review_details.groupby('product_category').agg({
    'overall_rating': 'mean',
    'quality_rating': 'mean',
    'value_rating': 'mean',
    'would_recommend': 'mean',  # Proportion that would recommend
    'verified_purchase': 'mean'  # Proportion of verified purchases
}).round(2)

In [None]:
# You can also extract nested fields like the pros and cons
# Let's collect all pros and cons across our reviews

all_points = []
for _, row in reviews_df.iterrows():
    review = json.loads(row['detailed_review']) if isinstance(row['detailed_review'], str) else row['detailed_review']
    product = row['product_name']
    category = row['product_category']
    
    # Extract all pros
    for point in review['pros_and_cons']['pros']:
        all_points.append({
            'product': product,
            'category': category,
            'type': 'pro',
            'point': point
        })
    
    # Extract all cons
    for point in review['pros_and_cons']['cons']:
        all_points.append({
            'product': product,
            'category': category,
            'type': 'con',
            'point': point
        })

# Create a DataFrame of all pros and cons
points_df = pd.DataFrame(all_points)
points_df.head(10)

## Conclusion

Congratulations! You've completed a comprehensive tour of Data Designer's core capabilities. In this tutorial, you've learned how to:

1. Set up a Data Designer project
2. Create various types of sampler columns (categorical, numerical, date/time, UUID)
3. Define relationships between columns using subcategories
4. Use expression columns with Jinja templates to create derived values
5. Generate realistic person data with demographic details
6. Create complex order and transaction data
7. Use LLMs to generate contextual text like descriptions and reviews
8. Add constraints to ensure data validity
9. Generate and save your dataset
10. Work with seed data from your own dataframes or files
11. Create structured outputs using both Pydantic models and JSON Schema

With these skills, you're well-equipped to create sophisticated synthetic datasets for a wide range of applications, from testing and development to privacy protection and AI training.

Check out the other notebooks in this repository for more examples and specialized use cases!