### Simple Doctors Notes
This example notebook walks you through using AI Data Designer to create a simple synthetic dataset for doctors notes. This walk through will highlight how to build up a data set intuitively from simple statistical methods, layered with more complex AI generated columns to synthesize a full rich data set.

### Initializing the Client

First step is to initialize the AI Data Designer (AIDD) client, logging in with your Gretel credentials.

In [2]:
from datasets import load_dataset

from gretel_client.navigator_client import Gretel

gretel = Gretel(api_key="prompt")

Gretel API Key: ··········
Logged in as travis@gretel.ai ✅


INFO:gretel_client.navigator_client:Gretel client configured to use project: proj_2u8UDpxu7JxxZwr0re7EOSGhHPk


### Blood Pressure Columns
Next we'll start building up our synthetic dataset with some simple sampling techniques, to create basic health metrics for the patient. We'll create three columns, a `patient_id`, `bp_systolic`, and `bp_diastolic`. Then we will generate a preview and take a look at the results.

In [21]:
# Instantiate a AI Data Designer object using the apache-2.0 model Suite
aidd = gretel.data_designer.new(model_suite="apache-2.0")

aidd.add_column(
    name="patient_id",
    type="uuid",
    params={"prefix": "PT-", "short_form": True, "uppercase": True}
).add_column(
    name="bp_systolic",
    type="gaussian",
    params={"mean": 145.0, "stddev": 30.0, "convert_to_int": True}
).add_column(
    name="bp_diastolic",
    type="gaussian",
    params={"mean": 80.0, "stddev": 10.0}
)

### Generating a Preview
Now that you've defined the data you want, you can create a Preview of the data by calling `aidd.preview()`. By capturing that preview as a variable, you can look at a sample record by calling `preview.display_sample_record()`, or you can access the underlying Pandas Dataframe by calling `preview.dataset.df`

In [22]:
preview = aidd.preview()
preview.display_sample_record()

[19:26:03] [INFO] 🚀 Generating preview
[19:26:03] [INFO] ⚙️ Configuring Data Designer Workflow steps:
[19:26:03] [INFO]   |-- Step 1: generate_columns_using_samplers-1
[19:26:04] [INFO] 🦜 Step 1: Generate columns using samplers
[19:26:04] [INFO]   |-- 🎲 Using numerical samplers to generate 10 records across 3 columns
[19:26:04] [INFO] 👀 Your dataset preview is ready for a peek!


In [23]:
preview.dataset.df.head()

Unnamed: 0,patient_id,bp_systolic,bp_diastolic
0,PT-CA1D2570,165.096881,83.335961
1,PT-994FF47F,129.250579,78.54258
2,PT-98AE9476,159.469497,72.500147
3,PT-F4319CE3,138.080187,82.240023
4,PT-7193752B,137.653491,87.833587


### Generating Free Text using LLMs
With the previous few lines of code, we've created a data set of example patients, each with a blood pressure reading. We can now easily create a new `doctors_note` column generated by an LLM, by passing in the metrics we've generated to a custom prompt. We'll add a simple prompt that uses Jinja templating syntax to pass in the columns `{{ patient_id }}`, `{{ bp_systolic }}`, and `{{ bp_diastolic }}`.


In [24]:
aidd.add_column(
    name="doctors_note",
    prompt="""
    Create a doctors recommendation based on the patients blood pressure, where blood pressure
    for patient={{ patient_id }} is equal to {{ bp_systolic }}/{{ bp_diastolic }}
    """
)

In [25]:
aidd.preview().display_sample_record()

[19:26:13] [INFO] 🚀 Generating preview
[19:26:14] [INFO] ⚙️ Configuring Data Designer Workflow steps:
[19:26:14] [INFO]   |-- Step 1: generate_columns_using_samplers-1
[19:26:14] [INFO]   |-- Step 2: generate_column_from_template-2
[19:26:15] [INFO] 🦜 Step 1: Generate columns using samplers
[19:26:15] [INFO]   |-- 🎲 Using numerical samplers to generate 10 records across 3 columns
[19:26:15] [INFO] 🦜 Step 2: Generate column from template
[19:26:15] [INFO]   |-- 📝 Preparing template to generate data column `doctors_note`
[19:26:15] [INFO]   |   |-- model_alias: ModelAlias.NATURAL_LANGUAGE
[19:26:37] [INFO]   |-- Generation summary for field: doctors_note
[19:26:37] [INFO]   |-- 	Total inference requests: 10
[19:26:37] [INFO]   |-- 	Successful requests: 10
[19:26:37] [INFO]   |-- Model usage: [{"model": "gretel/stelterlab/Mistral-Small-24B-Instruct-2501-AWQ", "prompt_tokens": 788, "completion_tokens": 5249, "request_count": 10, "total_tokens": 6037}]
[19:26:38] [INFO] 👀 Your dataset previ

### Expanding on the example
With this simple example, it's easy to see how we can build up data sets by mixing sampling techniques with AI generated columns. We get thoughtful, realistic doctors notes for a range of patient statistics with only a few lines of code, and we can scale this out as needed.

However, looking at the above we can see how there's room for improvement. We are referring to patients only by their ID, rather than by their name. We can use a Person Sampler to fill in names for the patient. We'll also update our generation prompt to reference the new columns.  

In [28]:
aidd.with_person_samplers({"patient_sampler": {"locale": "en_GB"}})
aidd.add_column(
    name="first_name",
    type="expression",
    params={"expr": "patient_sampler.first_name"}
)
aidd.add_column(
    name="last_name",
    type="expression",
    params={"expr": "patient_sampler.last_name"}
)
aidd.add_column(
    name="doctors_note",
    prompt="""
    Create a doctors recommendation based on the patients blood pressure
    for a patient with Id {{ patient_id }}, name {{ first_name }} + {{ last_name }},
    and blood pressure of {{ bp_systolic }}/{{ bp_diastolic }}
    """
)

In [30]:
aidd.preview().display_sample_record()

[19:28:58] [INFO] 🚀 Generating preview
[19:28:59] [INFO] ⚙️ Configuring Data Designer Workflow steps:
[19:28:59] [INFO]   |-- Step 1: generate_columns_using_samplers-1
[19:28:59] [INFO]   |-- Step 2: generate_column_from_template-2
[19:28:59] [INFO]   |-- Step 3: drop_columns-3
[19:28:59] [INFO] 🦜 Step 1: Generate columns using samplers
[19:28:59] [INFO]   |-- 🎲 Using numerical samplers to generate 10 records across 6 columns
[19:29:00] [INFO] 🦜 Step 2: Generate column from template
[19:29:00] [INFO]   |-- 📝 Preparing template to generate data column `doctors_note`
[19:29:00] [INFO]   |   |-- model_alias: ModelAlias.NATURAL_LANGUAGE
[19:29:25] [INFO]   |-- Generation summary for field: doctors_note
[19:29:25] [INFO]   |-- 	Total inference requests: 10
[19:29:25] [INFO]   |-- 	Successful requests: 10
[19:29:25] [INFO]   |-- Model usage: [{"model": "gretel/stelterlab/Mistral-Small-24B-Instruct-2501-AWQ", "prompt_tokens": 844, "completion_tokens": 5861, "request_count": 10, "total_tokens"

### Full Example

In [20]:
# Instantiate a AI Data Designer object using the apache-2.0 model Suite
aidd = gretel.data_designer.new(model_suite="apache-2.0")
aidd.with_person_samplers({"patient_sampler": {"locale": "en_GB"}})

aidd.add_column(
    name="patient_id",
    type="uuid",
    params={"prefix": "PT-", "short_form": True, "uppercase": True}
).add_column(
    name="bp_systolic",
    type="gaussian",
    params={"mean": 145.0, "stddev": 30.0, "convert_to_int": True}
).add_column(
    name="bp_diastolic",
    type="gaussian",
    params={"mean": 80.0, "stddev": 10.0}
).add_column(
    name="first_name",
    type="expression",
    params={"expr": "patient_sampler.first_name"}
).add_column(
    name="last_name",
    type="expression",
    params={"expr": "patient_sampler.last_name"}
).add_column(
    name="doctors_note",
    prompt="""
    Create a doctors recommendation based on the patients blood pressure
    for
    - Patient ID is {{ patient_id }}.
    - Patient Name is {{ first_name }} + {{ last_name }}
    - Blood Pressure is {{ bp_systolic }}/{{ bp_diastolic }}
    """
)
preview = aidd.preview()
preview.display_sample_record()

[19:25:22] [INFO] 🚀 Generating preview
[19:25:22] [INFO] ⚙️ Configuring Data Designer Workflow steps:
[19:25:22] [INFO]   |-- Step 1: generate_columns_using_samplers-1
[19:25:22] [INFO]   |-- Step 2: generate_column_from_template-2
[19:25:22] [INFO]   |-- Step 3: drop_columns-3
[19:25:23] [INFO] 🦜 Step 1: Generate columns using samplers
[19:25:23] [INFO]   |-- 🎲 Using numerical samplers to generate 10 records across 6 columns
[19:25:23] [INFO] 🦜 Step 2: Generate column from template
[19:25:24] [INFO]   |-- 📝 Preparing template to generate data column `doctors_note`
[19:25:24] [INFO]   |   |-- model_alias: ModelAlias.NATURAL_LANGUAGE
[19:25:47] [INFO]   |-- Generation summary for field: doctors_note
[19:25:47] [INFO]   |-- 	Total inference requests: 10
[19:25:47] [INFO]   |-- 	Successful requests: 10
[19:25:47] [INFO]   |-- Model usage: [{"model": "gretel/stelterlab/Mistral-Small-24B-Instruct-2501-AWQ", "prompt_tokens": 927, "completion_tokens": 5500, "request_count": 10, "total_tokens"

### Conclusion
This example shows how easy it is to iterate on Synthetic Datasets using Gretel's AI Data Designer. It's not hard to see how you would extend this to have a more complete set of patient metrics for the notes, such as BMI, Height, Age, Pulse, Temperature, etc. As an exercise, try to add those columns and update the prompt to get richer, more complete synthetic notes. Take a look at the other samplilng column types available below, and try to customize the dataset to your preferences!

### 🎲 Sampling Column Types

These are the current non-llm data sources that are available in AIDD.

| Type | Parameters | Notes |
|------|-----------|-------:|
| expression | `expr: str` | This is powered by jinja. |
| category | `values: list[str \| int \| float]`<br>`weights: Optional[list[float]]` | |
| subcategory | `category: str`<br>`values: dict[str, list[str \| int \| float]]` | `category` must refer to an existing category column.|
| datetime | `start: str`<br>`end: str`<br>`unit: Literal["Y", "M", "D", "h", "m", "s"] = "D"` | |
| timedelta | `dt_min: int` (>= 0)<br>`dt_max: int` (> 0)<br>`reference_column_name: str`<br>`unit: Literal["D", "h", "m", "s"] = "D"` | |
| uuid | `prefix: Optional[str]`<br>`short_form: bool = False`<br>`uppercase: bool = False` | |
| scipy | `dist_name: str`<br>`dist_params: dict` | This exposes all distributions that are available in [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html). |
| binomial | `n: int`<br>`p: float` | |
| bernoulli | `p: float` | |
| gaussian | `mean: float`<br>`stddev: float` | |
| poisson | `mean: float` | |
| uniform | `low: float`<br>`high: float` | |
| person | `locale: str = "en_US"`<br>`sex: SexT \| list[SexT] \| None = None`<br>`city: str \| list[str] \| None = None`<br>*(where `SexT = Literal["Male", "Female"]`)* | When `locale = "en_US"`, this is powered by our PGM. <br> Otherwise, it uses `Faker` (quality not guaranteed in this case).  |

> **Note:** The error messages related to the configuration of these sources is something we are actively improving.