<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/data-designer-101/3-seeding-with-a-dataset.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Data Designer 101: Seeding synthetic data generation with an external dataset

In this notebook, we will demonstrate how to seed synthetic data generation in `DataDesigner` with an external dataset.


If this is your first time using `DataDesigner`, we recommend starting with the [first notebook](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/data-designer-101/1-the-basics.ipynb) in this 101 series.


<br>

### 💾 Install `gretel-client` and its dependencies

In [None]:
%%capture
%pip install git+https://github.com/gretelai/gretel-python-client datasets

In [None]:
from gretel_client.navigator_client import Gretel

# The Gretel object is the SDK's main entry point for interacting with Gretel's API.
gretel = Gretel(api_key="prompt", endpoint="https://api.dev.gretel.ai")

## 🏥 Download a seed dataset

- For this notebook, we'll change gears and create a synthetic dataset of patient notes.

- To steer the generation process, we will use Gretel's open-source [symptom-to-diagnosis dataset](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis).

In [None]:
from datasets import load_dataset

df_seed = load_dataset("gretelai/symptom_to_diagnosis")["train"].to_pandas()
df_seed = df_seed.rename(columns={"output_text": "diagnosis", "input_text": "patient_summary"})

print(f"Number of records: {len(df_seed)}")

df_seed.head()

## 👩‍⚕️ Designing our synthetic patient notes dataset

- We set the seed dataset using the `with_seed_dataset` method.

- We use the `shuffle` sampling strategy, which shuffles the seed dataset before sampling.

- We set `with_replacement=False`, so our max num_records is 853, 


In [None]:
aidd = gretel.data_designer.new(model_suite="apache-2.0")

aidd.with_seed_dataset(
    df_seed,
    sampling_strategy="shuffle",
    with_replacement=False
)

# Empty dictionaries mean use default settings for the person samplers.
aidd.with_person_samplers({"patient_sampler": {}, "doctor_sampler": {}})

In [None]:
# Here we demonstrate how you can add a column by calling `add_column` with the 
# column name, column type, and any parameters for that column type. This is in 
# contrast to using the column and parameter type objects, via `C` and `P`, as we 
# did in the previous notebook. Generally, we recommend using the concrete column
# and parameter type objects, but this is a convenient shorthand when you are 
# familiar with the required arguments for each type.

aidd.add_column(
    name="patient_id",
    type="uuid",
    params={"prefix": "PT-", "short_form": True, "uppercase": True},
)

aidd.add_column(
    name="first_name",
    type="expression",
    expr="{{ patient_sampler.first_name}} ",
)

aidd.add_column(
    name="last_name",
    type="expression",
    expr="{{ patient_sampler.last_name }}",
)


aidd.add_column(
    name="dob",
    type="expression",
    expr="{{ patient_sampler.birth_date }}"
)


aidd.add_column(
    name="patient_email",
    type="expression",
    expr="{{ patient_sampler.email_address }}",
)


aidd.add_column(
    name="symptom_onset_date",
    type="datetime",
    params={"start": "2024-01-01", "end": "2024-12-31"},
)

aidd.add_column(
    name="date_of_visit",
    type="timedelta",
    params={
        "dt_min": 1,
        "dt_max": 30,
        "reference_column_name": "symptom_onset_date"
    },
)

aidd.add_column(
    name="physician",
    type="expression",
    expr="Dr. {{ doctor_sampler.last_name }}",
)

# Note we have access to the seed data fields.
aidd.add_column(
    name="physician_notes",
    prompt="""\
You are a primary-care physician who just had an appointment with {{ first_name }} {{ last_name }},
who has been struggling with symptoms from {{ diagnosis }} since {{ symptom_onset_date }}.
The date of today's visit is {{ date_of_visit }}.

{{ patient_summary }}

Write careful notes about your visit with {{ first_name }},
as Dr. {{ doctor_sampler.first_name }} {{ doctor_sampler.last_name }}.

Format the notes as a busy doctor might.
"""
 )


aidd.with_evaluation_report().validate()

## 👀 Preview the dataset

- Iteration is key to generating high-quality synthetic data.

- Use the `preview` method to generate 10 records for inspection.

In [None]:
preview = aidd.preview()

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset.df.head()

In [None]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

## 🆙 Scale up!

- Once you are happy with the preview, scale up to a larger dataset by submitting a batch workflow.

- You can view the evaluation report by following the workflow link in the output of `create` below.

- Click the link to follow along with the generation process.

In [None]:
workflow_run = aidd.create(num_records=100, name="aidd-101-notebook-3-patient-notes")