<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/healthcare-datasets/clinical-trials.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# ðŸŽ¨ Data Designer: Synthetic Clinical Trial Dataset Generator

This notebook creates a synthetic dataset of clinical trial records with realistic PII (Personally Identifiable Information) for testing data protection and anonymization techniques.

The dataset includes:
- Trial information and study design
- Participant demographics and health data (PII)
- Investigator and coordinator information (PII)
- Medical observations and notes with embedded PII
- Adverse event reports with varying severity

We'll use Data Designer to create this fully synthetic dataset from scratch.

In [None]:
%%capture
%pip install -U gretel_client

## Setting up Data Designer

First, we'll initialize the Gretel client and create a new Data Designer object. We'll use the `apache-2.0` model suite for this project.

In [None]:
from gretel_client.navigator_client import Gretel # type: ignore

# Initialize Gretel client - this will prompt for your API key
gretel = Gretel(api_key="prompt")

aidd = gretel.data_designer.new(model_suite="apache-2.0")

## Setting up Person Samplers

We'll create person samplers to generate consistent personal information for different roles in the clinical trial process:
- Participants (patients enrolled in the trial)
- Investigators (doctors conducting the trial)
- Study coordinators (staff managing the trial)
- Sponsors (pharmaceutical company representatives)

In [None]:
# Create person samplers for different roles, using en_GB locale
aidd.with_person_samplers({
    "participant": {"locale": "en_US"},
    "investigator": {"locale": "en_US"},
    "coordinator": {"locale": "en_US"},
    "sponsor": {"locale": "en_US"}
})

## Creating Trial Information

Next, we'll create the basic trial information:
- Study ID (unique identifier)
- Trial phase and therapeutic area
- Study design details
- Start and end dates for the trial

In [None]:
# Study identifiers
aidd.add_column(
    name="study_id",
    type="uuid",
    params={"prefix": "CT-", "short_form": True, "uppercase": True}
)

# Trial phase
aidd.add_column(
    name="trial_phase",
    type="category",
    params={
        "values": ["Phase I", "Phase II", "Phase III", "Phase IV"],
        "weights": [0.2, 0.3, 0.4, 0.1]
    }
)

# Therapeutic area
aidd.add_column(
    name="therapeutic_area",
    type="category",
    params={
        "values": ["Oncology", "Cardiology", "Neurology", "Immunology", "Infectious Disease"],
        "weights": [0.3, 0.2, 0.2, 0.15, 0.15]
    }
)

# Study design
aidd.add_column(
    name="study_design",
    type="subcategory",
    params={
        "category": "trial_phase",
        "values": {
            "Phase I": ["Single Arm", "Dose Escalation", "First-in-Human", "Safety Assessment"],
            "Phase II": ["Randomized", "Double-Blind", "Proof of Concept", "Open-Label Extension"],
            "Phase III": ["Randomized Controlled", "Double-Blind Placebo-Controlled", "Multi-Center", "Pivotal"],
            "Phase IV": ["Post-Marketing Surveillance", "Real-World Evidence", "Long-Term Safety", "Expanded Access"]
        }
    }
)

# Trial dates
aidd.add_column(
    name="trial_start_date",
    type="datetime",
    params={"start": "2022-01-01", "end": "2023-06-30"},
    convert_to="%Y-%m-%d"
)

aidd.add_column(
    name="trial_end_date",
    type="datetime",
    params={"start": "2023-07-01", "end": "2024-12-31"},
    convert_to="%Y-%m-%d"
)

## Participant Information

Now we'll create fields for participant demographics and enrollment details:
- Participant ID and basic information
- Demographics (age, gender, etc.)
- Enrollment status and dates
- Randomization assignment

In [None]:
# Participant identifiers and information
aidd.add_column(
    name="participant_id",
    type="uuid",
    params={"prefix": "PT-", "short_form": True, "uppercase": True}
)

aidd.add_column(
    name="participant_first_name",
    type="expression",
    expr="{{participant.first_name}}"
)

aidd.add_column(
    name="participant_last_name",
    type="expression",
    expr="{{participant.last_name}}"
)

aidd.add_column(
    name="participant_birth_date",
    type="expression",
    expr="{{participant.birth_date}}"
)

aidd.add_column(
    name="participant_email",
    type="expression",
    expr="{{participant.email_address}}"
)

# Enrollment information
aidd.add_column(
    name="enrollment_date",
    type="timedelta",
    params={
        "dt_min": 0,
        "dt_max": 60,
        "reference_column_name": "trial_start_date",
        "unit": "D"
    },
    convert_to="%Y-%m-%d"
)

aidd.add_column(
    name="participant_status",
    type="category",
    params={
        "values": ["Active", "Completed", "Withdrawn", "Lost to Follow-up"],
        "weights": [0.6, 0.2, 0.15, 0.05]
    }
)

aidd.add_column(
    name="treatment_arm",
    type="category",
    params={
        "values": ["Treatment", "Placebo", "Standard of Care"],
        "weights": [0.5, 0.3, 0.2]
    }
)

## Investigator and Staff Information

Here we'll add information about the trial staff:
- Investigator information (principal investigator)
- Study coordinator details
- Site information

In [None]:
# Investigator information
aidd.add_column(
    name="investigator_first_name",
    type="expression",
    expr="{{investigator.first_name}}"
)

aidd.add_column(
    name="investigator_last_name",
    type="expression",
    expr="{{investigator.last_name}}"
)

aidd.add_column(
    name="investigator_id",
    type="uuid",
    params={"prefix": "INV-", "short_form": True, "uppercase": True}
)

# Study coordinator information
aidd.add_column(
    name="coordinator_first_name",
    type="expression",
    expr="{{coordinator.first_name}}"
)

aidd.add_column(
    name="coordinator_last_name",
    type="expression",
    expr="{{coordinator.last_name}}"
)

aidd.add_column(
    name="coordinator_email",
    type="expression",
    expr="{{coordinator.email_address}}"
)

# Site information
aidd.add_column(
    name="site_id",
    type="category",
    params={
        "values": ["SITE-001", "SITE-002", "SITE-003", "SITE-004", "SITE-005"]
    }
)

aidd.add_column(
    name="site_location",
    type="category",
    params={
        "values": ["London", "Manchester", "Birmingham", "Edinburgh", "Cambridge"]
    }
)

# Study costs
aidd.add_column(
    name="per_patient_cost",
    type="gaussian",
    params={"mean": 15000, "stddev": 5000, "min": 5000}
)

aidd.add_column(
    name="participant_compensation",
    type="gaussian", 
    params={"mean": 500, "stddev": 200, "min": 100}
)

## Clinical Measurements and Outcomes

These columns will track the key clinical data collected during the trial:
- Vital signs and lab values
- Efficacy measurements 
- Dosing information

In [None]:
# Basic clinical measurements
aidd.add_column(
    name="baseline_measurement",
    type="gaussian",
    params={"mean": 100, "stddev": 15},
    convert_to="float"
)

aidd.add_column(
    name="final_measurement",
    type="gaussian",
    params={"mean": 85, "stddev": 20},
    convert_to="float"
)

# Calculate percent change
aidd.add_column(
    name="percent_change",
    type="expression",
    expr="{{(final_measurement - baseline_measurement) / baseline_measurement * 100}}"
)

# Dosing information
aidd.add_column(
    name="dose_level",
    type="category",
    params={
        "values": ["Low", "Medium", "High", "Placebo"],
        "weights": [0.3, 0.3, 0.2, 0.2]
    }
)

aidd.add_column(
    name="dose_frequency",
    type="category",
    params={
        "values": ["Once daily", "Twice daily", "Weekly", "Biweekly"],
        "weights": [0.4, 0.3, 0.2, 0.1]
    }
)

# Protocol compliance
aidd.add_column(
    name="compliance_rate",
    type="uniform",
    params={"low": 0.7, "high": 1.0}
)

## Adverse Events Tracking

Here we'll capture adverse events that occur during the clinical trial:
- Adverse event presence and type
- Severity and relatedness to treatment
- Dates and resolution

In [None]:
# Adverse event flags and details
aidd.add_column(
    name="has_adverse_event",
    type="bernoulli",
    params={"p": 0.3}
)

aidd.add_column(
    name="adverse_event_type",
    type="category",
    params={
        "values": ["Headache", "Nausea", "Fatigue", "Rash", "Dizziness", "Pain at injection site", "Other"],
        "weights": [0.2, 0.15, 0.15, 0.1, 0.1, 0.2, 0.1]
    },
    conditional_params={"has_adverse_event == 0": {"values": ["None"]}}
)

aidd.add_column(
    name="adverse_event_severity",
    type="category",
    params={"values": ["Mild", "Moderate", "Severe", "Life-threatening"]},
    conditional_params={"has_adverse_event == 0": {"values": ["NA"]}}
)

aidd.add_column(
    name="adverse_event_relatedness",
    type="category",
    params={
        "values": ["Unrelated", "Possibly related", "Probably related", "Definitely related"],
        "weights": [0.2, 0.4, 0.3, 0.1]
    },
    conditional_params={"has_adverse_event == 0": {"values": ["NA"]}}
)

aidd.add_column(
    name="adverse_event_resolved",
    type="category",
    params={"values": ["NA"]},
    conditional_params={"has_adverse_event == 1": {"values": ["Yes", "No"], "weights": [0.8, 0.2]}}
)

## Narrative text fields with style variations

These fields will contain natural language text that incorporates PII elements.
We'll use style seed categories to ensure diversity in the writing styles:

1. Medical observations and notes
2. Adverse event descriptions  
3. Protocol deviation explanations

In [None]:
# Documentation style category
aidd.add_column(
    name="documentation_style",
    type="category",
    params={
        "values": ["Formal and Technical", "Concise and Direct", "Detailed and Descriptive"],
        "weights": [0.4, 0.3, 0.3]
    }
)

# Medical observations - varies based on documentation style
aidd.add_column(
    name="medical_observations",
    type="llm-text",
    prompt="""
    {% if documentation_style == "Formal and Technical" %}
    Write formal and technical medical observations for participant {{ participant_first_name }} {{ participant_last_name }} 
    (ID: {{ participant_id }}) in the clinical trial for {{ therapeutic_area }} (Study ID: {{ study_id }}).
    
    Include observations related to their enrollment in the {{ dose_level }} dose group with {{ dose_frequency }} administration.
    Baseline measurement was {{ baseline_measurement }} and final measurement was {{ final_measurement }}, representing a 
    change of {{ percent_change }}%.
    
    Use proper medical terminology, maintain a highly formal tone, and structure the notes in a technical format with appropriate 
    sections and subsections. Include at least one reference to the site investigator, Dr. {{ investigator_last_name }}.
    {% elif documentation_style == "Concise and Direct" %}
    Write brief, direct medical observations for patient {{ participant_first_name }} {{ participant_last_name }} 
    ({{ participant_id }}) in {{ therapeutic_area }} trial {{ study_id }}.
    
    Note: {{ dose_level }} dose, {{ dose_frequency }}. Baseline: {{ baseline_measurement }}. Final: {{ final_measurement }}. 
    Change: {{ percent_change }}%.
    
    Keep notes extremely concise, using abbreviations where appropriate. Mention follow-up needs and reference 
    Dr. {{ investigator_last_name }} briefly.
    {% else %}
    Write detailed and descriptive medical observations for participant {{ participant_first_name }} {{ participant_last_name }}
    enrolled in the {{ therapeutic_area }} clinical trial ({{ study_id }}).
    
    Provide a narrative description of their experience in the {{ dose_level }} dose group with {{ dose_frequency }} dosing.
    Describe how their measurements changed from baseline ({{ baseline_measurement }}) to final ({{ final_measurement }}),
    representing a {{ percent_change }}% change.
    
    Use a mix of technical terms and explanatory language. Include thorough descriptions of observed effects and subjective
    patient reports. Mention interactions with the investigator, Dr. {{ investigator_first_name }} {{ investigator_last_name }}.
    {% endif %}
    """
)

# Adverse event descriptions - conditional on having an adverse event
aidd.add_column(
    name="adverse_event_description",
    type="llm-text",
    prompt="""
    {% if has_adverse_event == 1 %}
    [INSTRUCTIONS: Write a brief clinical description (1-2 sentences only) of the adverse event. Use formal medical language. Do not include meta-commentary or explain what you're doing.]\
    {{adverse_event_type}}, {{adverse_event_severity}}. {{adverse_event_relatedness}} to study treatment. 
    {% if adverse_event_resolved == "Yes" %}Resolved.{% else %}Ongoing.{% endif %}
    {% else %}
    [INSTRUCTIONS: Output only the exact text "No adverse events reported" without any additional commentary.]\
    No adverse events reported.\
    {% endif %}
    """
)

# Protocol deviation description (if compliance is low)
aidd.add_column(
    name="protocol_deviation",
    type="llm-text",
    prompt="""
    {% if compliance_rate < 0.85 %}
    {% if documentation_style == "Formal and Technical" %}
    [FORMAT INSTRUCTIONS: Write in a direct documentation style. Do not use phrases like "it looks like" or "you've provided". Begin with the protocol deviation details. Use formal terminology.]
    
    PROTOCOL DEVIATION REPORT
    Study ID: {{ study_id }}
    Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})
    Compliance Rate: {{ compliance_rate }}
    
    [Continue with formal description of the deviation, impact on data integrity, and corrective actions. Reference coordinator {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_last_name }}]
    {% elif documentation_style == "Concise and Direct" %}
    [FORMAT INSTRUCTIONS: Use only brief notes and bullet points. No introductions or explanations.]
    
    PROTOCOL DEVIATION - {{ participant_id }}
    â€¢ Compliance: {{ compliance_rate }}
    â€¢ Impact: [severity level]
    â€¢ Actions: [list actions]
    â€¢ Coordinator: {{ coordinator_first_name }} {{ coordinator_last_name }}
    â€¢ PI: Dr. {{ investigator_last_name }}
    {% else %}
    [FORMAT INSTRUCTIONS: Write a narrative description. Begin directly with the deviation details. No meta-commentary.]
    
    During the {{ therapeutic_area }} study at {{ site_location }}, participant {{ participant_first_name }} {{ participant_last_name }} demonstrated a compliance rate of {{ compliance_rate }}, which constitutes a protocol deviation.
    
    [Continue with narrative about circumstances, discovery, impact, and team response. Include references to {{ coordinator_first_name }} {{ coordinator_last_name }} and Dr. {{ investigator_first_name }} {{ investigator_last_name }}]
    {% endif %}
    {% else %}
    [FORMAT INSTRUCTIONS: Write a simple direct statement. No meta-commentary or explanation.]
    
    PROTOCOL COMPLIANCE ASSESSMENT
    Participant: {{ participant_first_name }} {{ participant_last_name }} ({{ participant_id }})
    Finding: No protocol deviations. Compliance rate: {{ compliance_rate }}.
    {% endif %}
    """
)

## Adding Constraints

Finally, we'll add constraints to ensure our data is logically consistent:
- Trial dates must be in proper sequence
- Adverse event dates must occur after enrollment
- Measurement changes must be realistic

In [None]:
# Ensure appropriate date sequence
aidd.add_constraint(
    target_column="trial_end_date",
    type="column_inequality",
    params={"operator": "gt", "rhs": "trial_start_date"}
)

aidd.add_constraint(
    target_column="enrollment_date",
    type="column_inequality",
    params={"operator": "ge", "rhs": "trial_start_date"}
)

aidd.add_constraint(
    target_column="enrollment_date",
    type="column_inequality",
    params={"operator": "lt", "rhs": "trial_end_date"}
)

# Ensure reasonable clinical measurements
aidd.add_constraint(
    target_column="baseline_measurement",
    type="scalar_inequality",
    params={"operator": "gt", "rhs": 0}
)

aidd.add_constraint(
    target_column="final_measurement",
    type="scalar_inequality",
    params={"operator": "gt", "rhs": 0}
)


## Preview and Generate Dataset

First, we'll preview a small sample to verify our configuration is working correctly.
Then we'll generate the full dataset with the desired number of records.

In [None]:
# Preview a few records
preview = aidd.preview()

In [None]:
# More previews
preview.display_sample_record()

In [None]:
# Define a common name for the workflow and file
workflow_name = "clinical-trial-data"

# Submit batch job
workflow_run = aidd.create(
    num_records=100,
    name=workflow_name
)

workflow_run.wait_until_done()
print("\nGenerated dataset shape:", workflow_run.dataset.df.shape)

In [None]:
# Display the first few rows of the generated dataset
workflow_run.dataset.df.head()

In [None]:
# Save the dataset
csv_filename = f"{workflow_name}.csv" 
workflow_run.dataset.df.to_csv(csv_filename, index=False)
print(f"Dataset with {len(workflow_run.dataset.df)} records saved to {csv_filename}")