# 🧑‍🤝‍🧑 Navigator Data Designer: Person Sampler Tutorial

Welcome to this tutorial on using the Person Sampler in Gretel's Data Designer! In this notebook, we'll explore how to generate realistic personal information for your synthetic datasets.

## What is the Person Sampler?

The Person Sampler is a powerful feature in Data Designer that generates consistent, realistic person records with attributes like:
- Names (first, middle, last)
- Contact information (email, phone)
- Addresses (street, city, state, zip)
- Demographics (age, gender, ethnicity)
- IDs (SSN, UUID)
- And more!

These records are fully synthetic but maintain the statistical properties and formatting patterns of real personal data.

## Setup and Installation

Let's start by installing the necessary packages and setting up our Gretel client.

In [None]:
%%capture
# Install the latest version of Gretel client and dependencies
%pip install -U git+https://github.com/gretelai/gretel-python-client

In [None]:
# Import necessary libraries
import pandas as pd

from gretel_client.navigator_client import Gretel
from gretel_client.data_designer import columns as C
from gretel_client.data_designer import params as P

# Create Gretel Client
gretel = Gretel(
    api_key="prompt",  # This will prompt for your API key
    endpoint="https://api.dev.gretel.ai"
)

# Create a new Data Designer object
model_suite = "apache-2.0"
dd = gretel.data_designer.new(model_suite=model_suite)

## 1. Basic Person Sampling

Let's start with a simple example of generating person data using the default settings.

In [None]:
# Add a simple person column with default settings
dd.add_column(
    C.SamplerColumn(
        name="person",  # This creates a nested object with all person attributes
        type=P.SamplingSourceType.PERSON,
        params=P.PersonSamplerParams(locale="en_US", sex="Male")
    )
)

# # Preview what the generated data looks like
preview = dd.preview()
preview.dataset.df

## 2. Accessing Individual Person Attributes

The `person` column we created above is a nested object with many attributes. Let's create some columns to access specific attributes from this person object.

In [None]:
# Add columns to extract specific attributes from the person object
dd.add_column(
    C.ExpressionColumn(
        name="full_name",
        expr="{{ person.first_name }} {{ person.last_name }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="email",
        expr="{{ person.email_address }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="address",
        expr="{{ person.street_number }} {{ person.street_name }}, {{ person.city }}, {{ person.state }} {{ person.zipcode }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="age",
        expr="{{ person.age }}"
    )
)

# Preview the results
preview = dd.preview()
preview.dataset.df[['full_name', 'email', 'address', 'age']]

## 3. Customizing Person Generators

Now let's explore customizing the Person Sampler to generate specific types of profiles.

In [None]:
# Reset our Data Designer object
dd = gretel.data_designer.new(model_suite=model_suite)

# Create custom person samplers for different roles/demographics
dd.add_column(
    C.SamplerColumn(
        name="employee",
        type=P.SamplingSourceType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",
            age_min=22,
            age_max=65,
            city="San Francisco",
            state="CA"
        )
    )
)

dd.add_column(
    C.SamplerColumn(
        name="customer",
        type=P.SamplingSourceType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_US",  # US locale
            age_min=18,
            age_max=80
            # No city/state restrictions
        )
    )
)

# Create a UK-based person
dd.add_column(
    C.SamplerColumn(
        name="uk_contact",
        type=P.SamplingSourceType.PERSON,
        params=P.PersonSamplerParams(
            locale="en_GB",  # UK locale
            city="London"
        )
    )
)

# Add columns to extract and format information
dd.add_column(
    C.ExpressionColumn(
        name="employee_info",
        expr="{{ employee.first_name }} {{ employee.last_name }}, {{ employee.age }} - {{ employee.city }}, {{ employee.state }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="customer_info",
        expr="{{ customer.first_name }} {{ customer.last_name }}, {{ customer.age }} - {{ customer.city }}, {{ customer.state }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="uk_contact_info",
        expr="{{ uk_contact.first_name }} {{ uk_contact.last_name }}, {{ uk_contact.phone_number }} - {{ uk_contact.city }}"
    )
)

# Preview the results
preview = dd.preview()
preview.dataset.df[['employee_info', 'customer_info', 'uk_contact_info']]

## 4. Available Person Attributes

The Person Sampler generates a rich set of attributes that you can use. Here's a reference list of some of the key attributes available:

| Attribute | Description | Example |
|-----------|-------------|--------|
| `first_name` | Person's first name | "John" |
| `middle_name` | Person's middle name (may be None) | "Robert" |
| `last_name` | Person's last name | "Smith" |
| `sex` | Person's sex | "Male" |
| `age` | Person's age in years | 42 |
| `birth_date` | Date of birth | "1980-05-15" |
| `email_address` | Email address | "john.smith@example.com" |
| `phone_number` | Phone number | "+1 (555) 123-4567" |
| `street_number` | Street number | "123" |
| `street_name` | Street name | "Main Street" |
| `unit` | Apartment/unit number | "Apt 4B" |
| `city` | City name | "Chicago" |
| `state` | State/province (locale dependent) | "IL" |
| `county` | County (locale dependent) | "Cook" |
| `zipcode` | Postal/ZIP code | "60601" |
| `country` | Country name | "United States" |
| `ssn` | Social Security Number (US locale) | "123-45-6789" |
| `occupation` | Occupation | "Software Engineer" |
| `marital_status` | Marital status | "Married" |
| `education_level` | Education level | "Bachelor's Degree" |
| `ethnic_background` | Ethnic background | "Caucasian" |
| `uuid` | Unique identifier | "550e8400-e29b-41d4-a716-446655440000" |

## 5. Creating Multiple Person Samplers with One Method

For convenience, Data Designer provides a `with_person_samplers` method to create multiple person samplers at once.

In [None]:
# Reset our Data Designer object
dd = gretel.data_designer.new(model_suite=model_suite)

# Create multiple person samplers at once
dd.with_person_samplers({
    "doctor": {"locale": "en_US", "age_min": 30, "age_max": 70},
    "patient": {"locale": "en_US", "age_min": 18, "age_max": 90},
    "nurse": {"locale": "en_US", "age_min": 25, "age_max": 65, "sex": "Female"},
    "international_doctor": {"locale": "fr_FR", "age_min": 35, "age_max": 65}
})

# Add columns to format information for each person type
dd.add_column(
    C.ExpressionColumn(
        name="doctor_profile",
        expr="Dr. {{ doctor.first_name }} {{ doctor.last_name }}, {{ doctor.age }}, {{ doctor.email_address }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="patient_profile",
        expr="{{ patient.first_name }} {{ patient.last_name }}, {{ patient.age }}, {{ patient.city }}, {{ patient.state }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="nurse_profile",
        expr="Nurse {{ nurse.first_name }} {{ nurse.last_name }}, {{ nurse.age }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="international_doctor_profile",
        expr="Dr. {{ international_doctor.first_name }} {{ international_doctor.last_name }}, {{ international_doctor.city }}, {{ international_doctor.country }}"
    )
)

# Preview the results
preview = dd.preview()
preview.dataset.df[['doctor_profile', 'patient_profile', 'nurse_profile', 'international_doctor_profile']]

## 6. Using Person Data with LLM Generation

One of the most powerful features of Data Designer is combining structured person data with LLM generation to create realistic, contextual content.

In [None]:
# Reset our Data Designer object
dd = gretel.data_designer.new(model_suite=model_suite)

# Create person samplers for patients and doctors
dd.with_person_samplers({
    "patient": {"locale": "en_US", "age_min": 18, "age_max": 85},
    "doctor": {"locale": "en_US", "age_min": 30, "age_max": 70}
})

# Add some medical condition sampling
dd.add_column(
    C.SamplerColumn(
        name="medical_condition",
        type=P.SamplingSourceType.CATEGORY,
        params=P.CategorySamplerParams(
            values=[
                "Hypertension", 
                "Type 2 Diabetes", 
                "Asthma", 
                "Rheumatoid Arthritis", 
                "Migraine", 
                "Hypothyroidism"
            ]
        )
    )
)

# Add basic info columns
dd.add_column(
    C.ExpressionColumn(
        name="patient_name",
        expr="{{ patient.first_name }} {{ patient.last_name }}"
    )
)

dd.add_column(
    C.ExpressionColumn(
        name="doctor_name",
        expr="Dr. {{ doctor.first_name }} {{ doctor.last_name }}"
    )
)

# Add an LLM-generated medical note
dd.add_column(
    C.LLMGenColumn(
        name="medical_notes",
        prompt=(
            "Write a brief medical note from {{ doctor_name }} about patient {{ patient_name }}, "
            "a {{ patient.age }}-year-old {{ patient.sex }} with {{ medical_condition }}. "
            "Include relevant medical observations and recommendations. "
            "The patient lives in {{ patient.city }}, {{ patient.state }} and works as {{ patient.occupation }}. "
            "Keep the note professional, concise (3-4 sentences), and medically accurate."
        )
    )
)

# Add an LLM-generated patient message
dd.add_column(
    C.LLMGenColumn(
        name="patient_message",
        prompt=(
            "Write a brief message (1-2 sentences) from {{ patient_name }} to {{ doctor_name }} "
            "about their {{ medical_condition }}. The message should reflect the patient's "
            "experience and concerns. The patient is {{ patient.age }} years old."
        )
    )
)

# Preview the results
preview = dd.preview()
preview.dataset.df[['patient_name', 'doctor_name', 'medical_condition', 'medical_notes', 'patient_message']]

## 7. Generating and Saving the Final Dataset

Now that we've explored the Person Sampler capabilities, let's generate a complete dataset and save it.

In [None]:
# Generate a final dataset
workflow_name = "synthetic-person-data"

# Submit the job to generate 100 records
workflow_run = dd.create(
    num_records=100,
    workflow_run_name=workflow_name,
    wait_for_completion=True
)

print(f"Generated dataset with {len(workflow_run.dataset.df)} records")

# Save the dataset to CSV
csv_filename = f"{workflow_name}.csv"
workflow_run.dataset.df.to_csv(csv_filename, index=False)
print(f"Dataset saved to {csv_filename}")

# Show a sample of the final dataset
workflow_run.dataset.df.head()

## Conclusion

In this tutorial, we've explored the Person Sampler functionality in Data Designer. We've learned how to:

1. Generate basic person records with realistic attributes
2. Customize person profiles by locale, age, gender, and location
3. Create multiple person samplers for different roles or demographics
4. Use person attributes in expressions and LLM prompts

The Person Sampler is an essential tool for creating realistic synthetic datasets for testing, development, and training applications that handle personal information.

For more advanced Data Designer features, check out the other notebooks in the getting-started folder!