# Getting Started with Data Designer


Welcome! In this guide, we’ll walk through how to use the SDK to generate rich, diverse datasets — from designing your columns to injecting variability and logic into your data.

### 🧙 Magic vs 🛠 Manual Usage

This guide supports two usage styles:

- **Magic Mode** 🪄  
  Enlist the help of an LLM to automatically generate your dataset configuration. Perfect for quick starts and exploring schema ideas.

- **Manual Mode** 🧩  
  Take full control and define your configuration by hand. Ideal when you want precision and complete customization.

---

### 🔁 Generation Methods

Our SDK supports **three flexible generation methods**:

- **Sampling** 🎲  
  Generate values using numerical or categorical distributions to generate realistic, balanced datasets.

- **LLM-generated** 🤖  
  Leverage large language models to create contextual data — such as natural language, code snippets, or structured text.

- **Seeded** 🌱  
  Provide your own seed dataset and sample from it to produce similar or derivative outputs.

---

### 🧠 Expression-Based Columns with Jinja

For more advanced control, you can define **expression-based columns** using Jinja templating. This unlocks:

- 🔀 Conditional logic (`if`, `else`, `for`)
- 🔗 Cross-column references (`{{ column_name }}`)
- ➕ Basic arithmetic and transformations

This lets you dynamically shape values based on other columns and inject logic directly into your data schema.

---

Let’s dive in and start building your dataset! 💡


## Installation and Setup

In [None]:
%%capture
# Install the latest version of Gretel client and dependencies
%pip install -U git+https://github.com/gretelai/gretel-python-client networkx datasets

In [1]:
# Import
import time
import pandas as pd

from gretel_client.navigator_client import Gretel
from gretel_client.data_designer import columns as C
from gretel_client.data_designer import params as P


## Create Gretel Client
gretel = Gretel(
    api_key="prompt",
    endpoint="https://api.dev.gretel.ai"
)


model_suite="apache-2.0"
dd = gretel.data_designer.new(model_suite=model_suite) # we will be building on top of this object


Found cached Gretel credentials
Logged in as kirit.thadaka@gretel.ai ✅
Gretel client configured to use project: proj_2uY0cfM0kjiegpyEZvCHNKZYxGf


# 🌱 Beginner: Kickstart with Magic + Seeded Data

In this section, we’re keeping things simple and powerful. You’ll learn how to get up and running quickly by combining:

- **Magic Mode** 🪄 — Let an LLM generate your dataset configuration for you, based on a few high-level inputs. (Note: Magic is still experimentatal so responses may occasionally be inaccurate)
- **Seeded Generation** 🌾 — Provide a sample dataset, and we’ll draw from it to create new rows with similar structure and variety.

This is a great way to start experimenting with the SDK while getting meaningful output fast. You don’t need to worry about crafting configs by hand just yet — we’ll guide you through using your own data and a touch of AI magic to get impressive results with minimal effort.

📦 By the end of this section, you’ll have a generated dataset built from your seed data and enriched through a config designed by the LLM.

Let’s dive in! 🔍


In [2]:
# Generated using RNG with add_sampling_column function
dd.magic.add_sampling_column(
    name = "patient",
    description = "An american male living in San Diego",
    interactive=False,
    preview=True
)

# Sampling based off of a distribution
dd.magic.add_sampling_column(
    name = "bmi",
    description = "body mass index of an average person",
    interactive=False,
    preview=True
)

Output()

[20:15:36] [INFO] 🚀 Generating preview


Output()

[20:15:37] [INFO] 🦜 Step 1: Generate columns using samplers


[20:15:40] [INFO] 🎉 Your dataset preview is ready!


Output()

[20:15:43] [INFO] 🚀 Generating preview


Output()

[20:15:44] [INFO] 🦜 Step 1: Generate columns using samplers


[20:15:46] [INFO] 🎉 Your dataset preview is ready!


In [3]:
#Adding Seed Data

# Load the seed dataset

from datasets import load_dataset
df_seed = load_dataset("gretelai/symptom_to_diagnosis")["train"].to_pandas()
df_seed = df_seed.rename(columns={"output_text": "diagnosis", "input_text": "patient_summary"})

print(f"Number of records: {len(df_seed)}")



df_seed.head(n=3)

PyTorch version 2.4.1 available.
Number of records: 853


Unnamed: 0,diagnosis,patient_summary
0,cervical spondylosis,I've been having a lot of pain in my neck and ...
1,impetigo,I have a rash on my face that is getting worse...
2,urinary tract infection,I have been urinating blood. I sometimes feel ...


In [4]:
# Add seed columns

dd.with_seed_dataset(
        df_seed,
        sampling_strategy="shuffle", #what are the options?
        with_replacement=False,
    )

[20:16:18] [INFO] 🌱 Using seed dataset with file ID: file_710cf01186f34b6a8a9e1643fd2fd2e1


In [5]:
dd.preview().dataset.df

[20:16:19] [INFO] 🚀 Generating preview
[20:16:20] [INFO] 🎲 Step 1: Sample from dataset
[20:16:22] [INFO] 🦜 Step 2: Generate columns using samplers
[20:16:25] [INFO] 🔗 Step 3: Concat datasets
[20:16:25] [INFO] 🎉 Your dataset preview is ready!


Unnamed: 0,diagnosis,patient_summary,patient,bmi
0,varicose veins,I have these cramps in my calves that are real...,"{'first_name': 'Ian', 'middle_name': '', 'last...",31.20548
1,impetigo,I have been having a fever for a few days and ...,"{'first_name': 'Anthony', 'middle_name': 'Blak...",38.046817
2,bronchial asthma,"I've been having a fever, a cough, and shortne...","{'first_name': 'Jerome', 'middle_name': 'J', '...",27.420705
3,diabetes,I have a feeling of tremors and muscle twitchi...,"{'first_name': 'Clinton', 'middle_name': 'L', ...",33.565807
4,hypertension,"I've been having headaches, chest pain, dizzin...","{'first_name': 'Leonard', 'middle_name': 'Grif...",32.310694
5,urinary tract infection,"I've been feeling really down lately, and my p...","{'first_name': 'Johnny', 'middle_name': '', 'l...",26.119208
6,varicose veins,My legs have been swollen for a few days. I ca...,"{'first_name': 'Brian', 'middle_name': 'Esteba...",28.717618
7,dengue,"I've been feeling really tired and weak, and I...","{'first_name': 'Joseph', 'middle_name': '', 'l...",24.824916
8,jaundice,I have been feeling really sick lately. I have...,"{'first_name': 'Kenneth', 'middle_name': '', '...",40.019826
9,peptic ulcer disease,I've had a change in my bowel movements. They'...,"{'first_name': 'Matthew', 'middle_name': 'Mich...",37.315529


In [6]:
# Add an LLM-generated column. This requires atleast one non-LLM generated column (either through seed or sampling)

dd.magic.add_column(
    name = "api_response",
    description = "An API response that updates a database with with a generated UUID and the diagnosis column",
    must_depend_on=["diagnosis", "patient"],
    interactive=True, #give feedback
    preview=True
)

Output()

[20:16:36] [INFO] 🚀 Generating preview


Output()

[20:16:37] [INFO] 🎲 Step 1: Sample from dataset


[20:16:38] [INFO] 🦜 Step 2: Generate columns using samplers


[20:16:41] [INFO] 🔗 Step 3: Concat datasets


[20:16:41] [INFO] 🦜 Step 4: Generate column from template


  f"[bold]\[{commands_str}] or your instructions[/bold]", default="retry"


ValidationError: 1 validation error for AIDDMetadata
code_langs.0
  Input should be 'python', 'sqlite', 'tsql', 'bigquery', 'mysql', 'postgres' or 'ansi' [type=enum, input_value='json', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/enum

In [None]:
# # display sample record
# dd.preview().display_sample_record()

# display sample df
dd.preview().dataset.df

# 🧪 Intermediate: Manual Configuration, Sampling, Jinja & LLM Generation

Now that you’ve gotten a feel for the basics, it’s time to roll up our sleeves and explore more of what the SDK can do.

In this section, we’ll move beyond Magic Mode and **manually create a configuration** to define our dataset structure. Here’s what you’ll be working with:

- **📊 Sampling Columns**  
  Learn how to:
  - Sample values from built-in or custom datasets
  - Use statistical distributions (like normal, uniform, categorical) to shape your data

- **🧠 Expression Columns with Jinja**  
  Add powerful logic and dynamic behavior to your data by:
  - Referencing values from other columns (`{{ other_column }}`)
  - Using conditional logic (`if`, `else`, `for`)
  - Performing basic arithmetic (`+`, `-`, `*`, `/`)

- **🤖 LLM-Based Columns**  
  Inject creativity and contextual intelligence by leveraging an LLM to generate column values like text, summaries, or even simple code.

This section will give you the tools to build smart, expressive datasets that go beyond static values — all by hand. It’s a perfect step toward mastering the full flexibility of the SDK.

Ready to get your hands dirty? 🛠 Let’s go!


#### 👩‍🚀 Person Attributes

| Field Name | Type | Default | Alias | Description |
|------------|------|---------|-------|-------------|
| first_name | str | Required | | Person's first name |
| middle_name | str \| None | Required | | Person's middle name (optional) |
| last_name | str | Required | | Person's last name |
| sex | SexT | Required | | Person's sex (enum type) |
| age | int | Required | | Person's age |
| postcode | str | Required | zipcode | Postal/ZIP code |
| street_number | int \| str | Required | | Street number (can be numeric or alphanumeric) |
| street_name | str | Required | | Name of the street |
| unit | str | Required | | Unit/apartment number |
| city | str | Required | | City name |
| region | str \| None | Required | state | Region/state (optional) |
| district | str \| None | Required | county | District/county (optional) |
| country | str | Required | | Country name |
| ethnic_background | str \| None | Required | | Ethnic background (optional) |
| marital_status | str \| None | Required | | Marital status (optional) |
| education_level | str \| None | Required | | Education level (optional) |
| bachelors_field | str \| None | Required | | Field of bachelor's degree (optional) |
| occupation | str \| None | Required | | Occupation (optional) |
| uuid | UUID | Required | | Unique identifier |
| locale | str | "en_US" | | Locale setting |
| phone_number | PhoneNumber \| None | Computed | | Generated phone number based on location (None for age < 18) |
| email_address | EmailStr \| None | Computed | | Generated email address (None for age < 18) |
| birth_date | date | Computed | | Calculated birth date based on age |
| national_id | str \| None | Computed | | National ID (SSN for US locale) |
| ssn | str \| None | Alias to national_id | | Alias for national_id |

In [None]:
# Add in sampled columns
dd.add_column(
    C.SamplerColumn(
        name="emergency_contact",
        type=P.SamplingSourceType.PERSON,
        params=P.PersonSamplerParams(locale='en_US',
                                     sex='Female',
                                     city='San Diego')
    )
)


# do vector weights need to sum to 1?
# what happens if sampling category set dont match with subcategory?
# what is scipy distribution?

dd.add_column(
    C.SamplerColumn(
        name="pet_type",
        type=P.SamplingSourceType.CATEGORY,
        params=P.CategorySamplerParams(values=["dog", "cat"],
                                       weights=[0.7, 0.3]),
    )
)

dd.add_column(
    C.SamplerColumn(
        name="first_pet_name",
        type=P.SamplingSourceType.SUBCATEGORY,
        params=P.SubcategoryParams(
            category="pet_type",
            values={
                "dog": ["Buddy", "Max", "Charlie", "Cooper", "Daisy", "Lucy"],
                "cat": ["Oliver", "Leo", "Milo", "Charlie", "Simba", "Luna"],

            }
        )
    )
)

# Sampling from a distribution (e.g. Bernoulli, binomial, poisson, etc..)
dd.add_column(
    C.SamplerColumn(
        name="household_income",
        type=P.SamplingSourceType.POISSON,
        params=P.PoissonSamplerParams(mean=100000)
    )
)

In [None]:
# Expression
# https://documentation.bloomreach.com/engagement/docs/jinja-syntax

# Referring to existing columns
dd.add_column(
    C.ExpressionColumn(
        name="patient_full_name",
        expr="{{ patient.first_name }} {{ patient.last_name }}"
    )
)

# Deterministically determine outcome based on arithmetic
dd.add_column(
    C.ExpressionColumn(
        name="net_worth",
        expr="{{ household_income - 50000}}"
    )
)


# Conditionally generate the values using expressions based on jinja templating
dd.add_column(
    C.ExpressionColumn(
        name="number_of_children",
        expr="{% if household_income > 100000 %}{{2}}{% else %}1{% endif %}"
    )
)


In [None]:
dd.preview().dataset.df

In [None]:
## Add in LLM generated Column

dd.add_column(
    C.LLMGenColumn(
        name="potential_cause",
        prompt=(
            "Write a brief backstory for how {{ patient }} got {{ diagnosis }}."
            "Ensure it is consistent with {{patient_summary}}."
            "Make it no more than 2 sentences."
        ),
    )

)

In [None]:
dd.preview().dataset.df

# 🧠 Advanced: Bringing It All Together with Complex LLM Prompts

Welcome to the final section — time to go full wizard mode. 🧙‍♂️

Here, we’ll combine everything you've learned so far into a **powerful, LLM-driven workflow**. Instead of manually configuring each column or relying solely on basic sampling, we’ll focus on crafting **rich, detailed prompts** that instruct the LLM to generate entire datasets with structure, logic, and nuance.

In this section, you’ll:

- 🧩 Bring together **Jinja expressions**, and **LLM columns** into a single cohesive config
- ✨ Design and refine complex LLM prompts to guide dataset creation with high fidelity and variability
- 🧠 Leverage the LLM’s contextual understanding to generate multi-column data with realistic relationships and patterns

This approach is ideal when you want to prototype ideas, simulate user behavior, generate synthetic logs, or create diverse, semi-structured content at scale.

By the end of this section, you’ll be able to use the SDK as a creative tool — part data engine, part storytelling assistant.

Let’s take it to the next level. 🚀


In [None]:
## Add in LLM generated Column

dd.add_column(
    C.LLMGenColumn(
        name="outcome",
        prompt=(
            "Write a breif outcome for what {{patient}} will result in. No more than 1 sentence."
            "{% if bmi > 25 %}"
            "They will have a negative outcome unless they change their lifestyle"
            "They will need the support of {{emergency_contact.first_name}}"
            "{% else %}"
            "They must start spending {{net_worth * 0.5}} dollars on a treatment plan."
            "{% endif %}"
        ),
    )

)

In [None]:
dd.preview().dataset.df