<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/getting-started/person-sampler-tutorial.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🧾 Navigator Data Designer: W-2 Dataset Generator

This notebook combines numerical samplers, the person samplers and LLMs to create a synthetic dataset of W-2 forms (US Wage & Tax Statements). 

### Generating realistic numerical values

We will use generate numerical fields using statistics published by the IRS for the most recent available year, 2021:

- https://www.irs.gov/pub/irs-pdf/p5385.pdf

### Generating realistic taxpayers

We will use the person sampler to generate realistic US taxpayers. When the US locale is chosen, staistics for generated persons reflect real-world US census data (note that when other locales are chosen, staistics are not representative of a realistic population).

## Setup and Installation

Let's start by installing the necessary packages and setting up our Gretel client.

In [1]:
%%capture
# Install the latest version of Gretel client and dependencies
%pip install -U gretel_client 

In [2]:
# Import necessary libraries
import pandas as pd

from gretel_client.navigator_client import Gretel
from gretel_client.data_designer.columns import SamplerColumn, ExpressionColumn, LLMTextColumn

# Create Gretel Client
gretel = Gretel(
    api_key="prompt",  # This will prompt for your API key
    endpoint="https://api.dev.gretel.ai"
)

model_suite = "apache-2.0"
# Create a new Data Designer object
dd = gretel.data_designer.new(model_suite=model_suite)

Found cached Gretel credentials
Logged in as dane.corneil+sandbox@gretellabs.com ✅
Using project: default-sdk-project-6e0e099a2b1a533
Project link: https://console-eng.gretel.ai/proj_2umSxvO7MU8ZVRGprRKPTKhVkKy


## Setting up taxpayer sampling

In [3]:
# Create a person sampler for an American paxpayer, and an employer sampler for generating the employer address.
dd.with_person_samplers({
    "taxpayer": {"locale": "en_US"},
    "employer": {"locale": "en_US"},
})

## Defining the fields

We will focus on the following:
- Box 1 (Wages, tips, and other compensation)
- Box 2 (Federal income tax withheld)
- Box 3 (Social security wages)
- Box 4 (Social security tax withheld)
- Box 5 (Medicare wages and tips)
- Box 6 (Medicate tax withheld)
- Box 7 (Social security tips)
- Box a (Employee's social security number)
- Box c (Employer's name, address and zip code)
- Box e (Employee's fist name, initial, and last name)
- Box f (Employee's address and zip code)

In [4]:
### BOX 1 (TOTAL WAGES, TIPS, AND OTHER COMPENSATION) ###

# From Page 6 of the IRS Statistics, we know that  276,388,660 / 277,981,454 W-2 forms had a non-zero value for Box 1 (99.4%).
# From Page 8 of the IRS Statistics, we know that the sum of this field across all forms was 9,920,000,000*$1000 = $9,920,000,000,000 dollars.
# Since there were 276,388,660 non-zero Box 1 values, the average value of Box 1 was $9,920,000,000,000 / 276,388,660 = $35,891.49.
# We will use a Bernoulli-Exponential mixture distribution to sample values for this field.
dd.add_column(
    SamplerColumn(
        name="box_1_wages_tips_other_compensation", 
        type="bernoulli_mixture",
        params={
            "p": 0.994,
            "dist_name": "expon",
            "dist_params": {"scale": 35891.49}
        },
        convert_to="int",
    )
)

### BOX 2 (FEDERAL INCOME TAX WITHHELD) ###

# Note: The calculations below are a simplification based on the assumption that this is an individual's only W-2.
# In practice, the taxable income is based on all wages for individuals with multiple W-2s.

# 2022 standard deduction
dd.add_column(
    ExpressionColumn(
        name="standard_deduction",
        expr="{% if taxpayer.marital_status == 'married_present' %}25900{% else %}12950{% endif %}",
        convert_to="float",
    ),
)

dd.add_column(
    ExpressionColumn(
        name="taxable_income",
        expr="{{ [0, box_1_wages_tips_other_compensation - standard_deduction]|max }}",
        convert_to="float",
    )
)

# We'll sum over the tax incurred at each 2022 tax bracket.
# For simplicity, we'll assume that the taxpayer is single here.
BRACKETS = [
    {"name": "bracket1", "rate": 0.10, "max": 10275, "min": 0},
    {"name": "bracket2", "rate": 0.12, "max": 41775, "min": 10275},
    {"name": "bracket3", "rate": 0.22, "max": 89075, "min": 41775},
    {"name": "bracket4", "rate": 0.24, "max": 170050, "min": 89075},
    {"name": "bracket5", "rate": 0.32, "max": 215950, "min": 170050},
    {"name": "bracket6", "rate": 0.35, "max": 539900, "min": 215950},
    {"name": "bracket7", "rate": 0.37, "max": 10000000000000, "min": 539900},
]
for bracket in BRACKETS:
    expression = f"{bracket['rate']}*([[taxable_income,{bracket['max']}]|min - {bracket['min']}, 0] | max)"
    dd.add_column(
        ExpressionColumn(
            name=bracket["name"],
            expr="{{ " + expression + " }}",
            convert_to="float",
        )
    )

# Sum the tax brackets to get the total withheld, on average
dd.add_column(
    ExpressionColumn(
        name="mean_tax_liability",
        expr="{{ bracket1 + bracket2 + bracket3 + bracket4 + bracket5 + bracket6 + bracket7 }}",
        convert_to="int",
    )
)

# Add some noise to get the actual withholding
dd.add_column(
    SamplerColumn(  
        name="tax_liability_noise",
        type="gaussian",
        params=dict(mean=1, stddev=0.1),
    )
)
dd.add_column(
    ExpressionColumn(
        name="box_2_federal_income_tax_withheld",
        expr="{{ (mean_tax_liability * tax_liability_noise) | int }}",
    )
)

### BOX 3 (SOCIAL SECURITY WAGES) ###

# From Page 8 of the IRS Statistics, we know that social security wages are, on average, 8,150,000,000/9,920,000,000 ~= 82.16% of total wages.
# We'll sample a ratio from a normal distribution with mean 0.8216 and standard deviation 0.2.
dd.add_column(
    SamplerColumn(
        name="social_security_wages_ratio",
        type="gaussian",
        params=dict(mean=0.8216, stddev=0.2),
        convert_to="float",
    )
)

dd.add_column(
    ExpressionColumn(
        name="box_3_social_security_wages",
        expr="{{ (box_1_wages_tips_other_compensation * social_security_wages_ratio) | int }}",
    )
)

### BOX 4 (SOCIAL SECURITY TAX WITHHELD) ###

# In 2022, social security tax was withheld at a rate of 6.2% of social security wages, up to a maximum of $147,000.
dd.add_column(
    ExpressionColumn(
        name="box_4_social_security_tax_withheld",
        expr="{{ (([box_3_social_security_wages, 147000]|min) * 0.062) | int }}",
    )
)

### BOX 5 (MEDICARE WAGES AND TIPS) ###

# From Page 8 of the IRS Statistics, we know that Medicare wages and tips are, on average, 10,300,000,000/9,920,000,000 ~= 103.8% of total wages.
dd.add_column(
    SamplerColumn(
        name="medicare_wages_and_tips_ratio",
        type="gaussian",
        params=dict(mean=1.038, stddev=0.2),
        convert_to="float",
    )
)

dd.add_column(
    ExpressionColumn(
        name="box_5_medicare_wages_and_tips",
        expr="{{ (box_1_wages_tips_other_compensation * medicare_wages_and_tips_ratio) | int }}",
    )
)

### BOX 6 (MEDICARE TAX WITHHELD) ###

# The standard employee Medicare tax rate in 2022 was 1.45% on all Medicare wages.
# The Additional Medicare Tax rate in 2022 was 0.9% on all Medicare wages in excess of $200,000.
dd.add_column(
    ExpressionColumn(
        name="box_6_medicare_tax_withheld",
        expr="{{ ((box_5_medicare_wages_and_tips * 0.0145) + (([box_5_medicare_wages_and_tips - 200000, 0]|max) * 0.009)) | int }}",
    )
)

### BOX 7 (SOCIAL SECURITY TIPS) ###

# From Page 6 of the IRS Statistics, we know that only 12,620,946 / 277,981,454 W-2 forms had a non-zero value for Box 7 (4.54%).
# From Page 8 of the IRS Statistics, we know that the sum of this field across all forms was 55,897,014*$1000 = $55,897,014,000.
# Since there were 12,620,946 non-zero Box 7 values, the average value of Box 7 was $55,897,014,000 / 12,620,946 = $4428.91.
# We will use a Bernoulli-Exponential mixture distribution to sample values for this field.
dd.add_column(
    SamplerColumn(
        name="box_7_social_security_tips",
        type="bernoulli_mixture",
        params={
            "p": 0.0454,
            "dist_name": "expon",
            "dist_params": {"scale": 4428.91}
        },
        convert_to="int",
    )
)

### BOX A (EMPLOYEE'S SOCIAL SECURITY NUMBER) ###

# We can use the ssn field of the person sampler to generate a valid SSN for the employee.

dd.add_column(
    ExpressionColumn(
        name="box_a_employee_ssn",
        expr="{{ taxpayer.ssn }}",
    )
)

### BOX C (EMPLOYER'S NAME, ADDRESS AND ZIP CODE) ###

# We want to generate a realistic company name.
# We'll start by generating a list of industries, expanded with magic.
dd.add_column(
    SamplerColumn(
        name="employer_business",
        type="category",
        params={
            "values": [
                "software",
                "health insurance",
                "shoe store",
                "restaurant",
                "plumbing",
            ]
        }
    )
).magic.extend_category("employer_business", n=25)

# Next, we'll generate an actual name based on the type of business.
dd.add_column(
    LLMTextColumn(
        name="employer_name",
        prompt="Generate an original name for a {{ employer_business }} business in {{ employer.city }}.",
    )
)

# Finally, we'll combine the employer name with the address of the employer.
dd.add_column(
    ExpressionColumn(
        name="box_c_employer_name_address_zip",
        expr="{{ employer_name }}\n{{ employer.street_number }} {{ employer.street_name }}\n{{ employer.city }}, {{ employer.state }} {{ employer.postcode }}",
    )
)

### BOX E (EMPLOYEE'S FIRST NAME, INITIAL, AND LAST NAME) ###

# We can extract the first name, initial, and last name from the person sampler.

dd.add_column(
    ExpressionColumn(
        name="box_e_employee_first_name_initial_last_name",
        expr="{{ taxpayer.first_name }} {{ taxpayer.middle_name[:1] }} {{ taxpayer.last_name }}",
    )
)

### BOX F (EMPLOYEE'S ADDRESS AND ZIP CODE) ###

# Similarly, we can extract the employee's address and zip code from the person sampler.

dd.add_column(
    ExpressionColumn(
        name="box_f_employee_address_zip",
        expr="{{ taxpayer.street_number }} {{ taxpayer.street_name }}\n{{ taxpayer.city }}, {{ taxpayer.state }} {{ taxpayer.postcode }}",
    )
)

# These are the columns we want in the final dataset, after dropping latent variables.
FINAL_COLUMNS = [
    "box_1_wages_tips_other_compensation",
    "box_2_federal_income_tax_withheld",
    "box_3_social_security_wages",
    "box_4_social_security_tax_withheld",
    "box_5_medicare_wages_and_tips",
    "box_6_medicare_tax_withheld",  
    "box_7_social_security_tips",
    "box_a_employee_ssn",
    "box_c_employer_name_address_zip",
    "box_e_employee_first_name_initial_last_name",
    "box_f_employee_address_zip",
]


Output()

Output()

In [5]:
# Preview the results
preview = dd.preview()
preview.dataset.df[FINAL_COLUMNS]

[23:25:28] [INFO] 🚀 Generating preview
[23:25:37] [INFO] 🎲 Step 1: Using samplers to generate 8 columns
[23:25:41] [INFO] 🦜 Step 2: Generating text column `employer_name`
[23:25:42] [INFO] 💬 Step 3: Rendering expression column `standard_deduction`
[23:25:43] [INFO] 💬 Step 4: Rendering expression column `box_3_social_security_wages`
[23:25:43] [INFO] 💬 Step 5: Rendering expression column `box_5_medicare_wages_and_tips`
[23:25:43] [INFO] 💬 Step 6: Rendering expression column `box_a_employee_ssn`
[23:25:44] [INFO] 💬 Step 7: Rendering expression column `box_e_employee_first_name_initial_last_name`
[23:25:44] [INFO] 💬 Step 8: Rendering expression column `box_f_employee_address_zip`
[23:25:44] [INFO] 💬 Step 9: Rendering expression column `box_c_employer_name_address_zip`
[23:25:45] [INFO] 💬 Step 10: Rendering expression column `taxable_income`
[23:25:45] [INFO] 💬 Step 11: Rendering expression column `box_4_social_security_tax_withheld`
[23:25:46] [INFO] 💬 Step 12: Rendering expression column

Unnamed: 0,box_1_wages_tips_other_compensation,box_2_federal_income_tax_withheld,box_3_social_security_wages,box_4_social_security_tax_withheld,box_5_medicare_wages_and_tips,box_6_medicare_tax_withheld,box_7_social_security_tips,box_a_employee_ssn,box_c_employer_name_address_zip,box_e_employee_first_name_initial_last_name,box_f_employee_address_zip
0,21638,0,24086,1493,22731,329,0,557-81-3959,Village Forge Hardware\n156 Templar Road\nMidd...,Sharon A Fuller,"140 East Minnesota Avenue\nCaliente, CA"
1,5555,0,4838,299,4515,65,0,030-86-2552,The Inked Parchment\n14 Mount Marcy Drive\nAlb...,Mary K Humes,"2 Nichols Drive\nQuincy, MA"
2,887,0,854,52,742,10,0,507-99-5038,AquaFlow Solutions\n502 Bigney Rd\nJacksonvill...,Charles J Cloer,"257 Creekside Circle\nLincoln, NE"
3,89196,9230,52091,3229,109420,1586,0,318-38-7854,Anderson Health Pharmacy\n419 Patricia Dr\nAnd...,Robert R Rivers,"565 W Main St\nCharleston, IL"
4,151101,28322,136716,8476,134393,1948,0,128-42-9157,Beacheside Health Center\n14 State Road 524\nJ...,Thomas W Attal,"17 Nys Route 22\nAndover, NY"
5,7661,0,3983,246,7496,108,0,455-36-7444,Tiny Timber Daycare\n973 SE Franklin St\nEdmon...,Mary Echevarria,"50 East Kemper Way\nDallas, TX"
6,21480,0,18536,1149,19879,288,0,464-92-2043,Piney Woods Bakery\n118 Fort Worth Hwy\nLufkin...,Victor J Ortiz,"246 East Almeria Road\nEl Paso, TX"
7,45716,2239,36910,2288,22800,330,0,214-40-8792,Rise & Shine Carbondale\n248 2100 Rd\nCarbonda...,Victor G Martinez,"88 26th Street\nSilver Spring, MD"
8,15290,222,10994,681,17732,257,0,230-79-8572,Voltage Vanguard\n5 Lantern Ln E\nSouth Boston...,Rhonda A Mcgrath,"85 Lovell St\nHampton, VA"
9,30137,1682,20543,1273,26436,383,0,335-54-9433,Paws & Petals of Wadsworth\n455 32 Mile Rd\nWa...,Olga M Newman,"188 Holthaus Rd\nMorrison, IL"


## Generating and Saving the Final Dataset

Once we're happy with the preview, we can generate a larger dataset.

In [6]:
# Generate a final dataset
workflow_name = "synthetic-w2-dataset"

# Submit the job to generate 100 records
workflow_run = dd.create(
    num_records=100,
    name=workflow_name
)

workflow_run.wait_until_done()

[23:25:52] [INFO] 🚀 Submitting batch workflow
▶️ Creating Workflow: w_2vq6fXRCNHpt0e9Tz993QW3qOKy
▶️ Created Workflow Run: wr_2vq6fjEjq8GVDgh9yaNEdaESxea
🔗 Workflow Run console link: https://console-dev.gretel.ai/workflows/w_2vq6fXRCNHpt0e9Tz993QW3qOKy/runs/wr_2vq6fjEjq8GVDgh9yaNEdaESxea
Fetching task logs for workflow run wr_2vq6fjEjq8GVDgh9yaNEdaESxea
Workflow run is now in status: RUN_STATUS_CREATED
Got task wt_2vq6fiG3gsb6Rf2vR1yEX0EivlI
Workflow run is now in status: RUN_STATUS_ACTIVE
[using-samplers-to-generate-8-columns] Task Status is now: RUN_STATUS_ACTIVE
[using-samplers-to-generate-8-columns] 2025-04-17 03:30:07.671416+00:00 Preparing step 'using-samplers-to-generate-8-columns'
[using-samplers-to-generate-8-columns] 2025-04-17 03:30:20.119251+00:00 Starting 'generate_columns_using_samplers' task execution
[using-samplers-to-generate-8-columns] 2025-04-17 03:30:20.127127+00:00 🎲 👨‍🍳 Creating person generator
[using-samplers-to-generate-8-columns] 2025-04-17 03:30:39.281408+00

In [7]:
print(f"Generated dataset with {len(workflow_run.dataset.df)} records")

# Save the dataset to CSV
csv_filename = f"{workflow_name}.csv"
workflow_run.dataset.df.to_csv(csv_filename, index=False)
print(f"Dataset saved to {csv_filename}")

# Show a sample of the final dataset
workflow_run.dataset.df.head()