<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/getting-started/data-designer-with-magic.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Data Designer: Getting Started with Magic


In this guide, we'll walk through how to use the SDK to generate rich, diverse datasets — from designing your columns to injecting variability and logic into your data. We recommend reviewing the [Data Designer 101 Tutorial](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/data-designer-101/1-the-basics.ipynb) as a starting point to understand the basic concepts of Data Designer first.

*ℹ️ Note, You will need your Gretel API key handy to complete this notebook. You can find it [on the console](https://console.gretel.ai/users/me/key).*

## What is Magic?

Magic is a set of LLM-assistance features for the Data Designer SDK. The goal of magic is to make creating diverse and high-quality datasets easier, faster, and more enjoyable. Magic features can be used to help you get started on a new Data Designer configuration or refine existing ones.

### Current Magic Features

Refer to the table below for information for current experimental features.

| Feature | Description |
| :-----  | :---------- |
| `magic.add_sampling_column` | Generate or update a sampling column configuration based on a text description or edit instruction. |
| `magic.extend_category` | Add `n` new value entries to a Category sampling column to increase diversity. |
| `magic.add_column` |  Generate or update an LLM generation column configuration based on a text description or edit instruction. |
| `magic.refine_prompt` | Vary a prompt template of an LLM generation column while while retaining its objective, or edit a prompt template with an instruction. |

## 🧙 Magic vs 🛠 Manual Usage

Since Magic features layer _on top_ of the existing Data Designer SDK, these features can be used interchagbly with manual configurations, allowing you to get help where you need it, and stay precise where you don't.


## Installation and Setup

---

### Notebok dependencies

In [1]:
%%capture
%pip install -U datasets git+https://github.com/gretelai/gretel-python-client@main

### Client Configuration

Next, we'll do a one-time login on a client instance, which we can re-use in our examples below. This cell will prompt you for your Gretel API Key, which you can enter into the provided input text area.

In [None]:
from gretel_client.navigator_client import Gretel

gretel = Gretel(api_key="prompt")

# 🌱 Beginner: Kickstart with Magic
---

Let's first dip our toes into Magic by creating a medical patient dataset. Our objective will be to create a dataset consisting of patients, some vital statistics, and some recent chart notes. If you've started with the [Data Designer 101](https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/data-designer-101/1-the-basics.ipynb) tutorial, then you already know that **samplers** are a critical and necessary step to building diverse datasets.

Sometimes, however, we might want to have a little help in getting started -- which column configurations should be used, what tools are available, etc. No sweat! Let's use Magic's `add_sampling_column` to help us get started on this dataset.  


## New Sampling Columns with Magic

In [None]:
## Create a new, blank data designer object
designer = gretel.data_designer.new(model_suite="apache-2.0")

## Use Magic to create a new sampling column.
designer.magic.add_sampling_column("patient", "An Ohioan between the ages of 65 and 85.")

So, what just happened? Calling `designer.magic.add_sampling_column(...)` added a **new** column, `"patient"`, to our dataset configuration! We can see four pieces of information displayed.

1. A confirmation of the column requested (`💬 patient`)
2. The `SamplerColumn` configuration returned -- which, should indicate both the specified age range `[65, 85]` and state (`'OH'`).
3. A confirmation that the `patient` column was added to the Data Designer object.
4. Finally, the state of `designer`. Since this is the first column we've added, we can see it simply contains `patient`.

If all we wanted to do was to have this single column in our dataset, we could stop here. The `preview` functionality works exactly the same.

In [None]:
preview_output = designer.preview()
preview_output.display_sample_record()

We don't have to stop here, however. We can also edit existing columns using the same function. Let's say we wanted to change the age range, we can do that, too.

In [None]:
## Use Magic to edit the existing patient column.
## In this mode, simply give it an instruction on what you'd like to see changed.
designer.magic.add_sampling_column(
    "patient",
    "Actually, add in Floridians and widen the age range to 45 to 85."
)

And just like that, we can see that we have included `'FL'` in the configuration and have an adjusted age range. Magic remembered the state of the current column and gave a context-aware update based on the command we just gave it.

However, sometimes we just needed to get started and know what we want. In that case, we can just specify it directly by copy-pasting the output from the `SamplerColumn` back into `designer` directly. Then, we can make whatever changes we want to in place.

In [None]:
## Import the desired column type
from gretel_client.data_designer.columns import SamplerColumn

## Copy pasting the above Magic-returned config and editing the sex field
designer.add_column(
    SamplerColumn(
      name='patient',
      type='person',
      params={'locale': 'en_US', 'sex': 'Female', 'city': None, 'age_range': [45, 85], 'state': ['OH', 'FL']},
      conditional_params={},
      convert_to=None
  )
)

We can see from the state of `designer` that we still only have a single column (`patient`), but it now has the configuration specified above.

In [None]:
designer.get_column("patient")

But we don't have to stop at `person` samplers; `add_sampling_column(...)` will do its best to infer the most appropriate sampling column type based on your description. Let's add several new columns to our dataset in this way.

In [None]:
designer.magic.add_sampling_column("pid", "a patient id number starting with PATIENT")
designer.magic.add_sampling_column("date_admitted", "Date patient admitted to the hospital, starting from 1995 and ending in 2004.")
designer.magic.add_sampling_column("hospital_status", "The current status of the patient, e.g. if they are IN_ROOM, ER, or RELEASED")
designer.magic.add_sampling_column("heart_rate", "Current heart rate, normal ranges")
designer.magic.add_sampling_column("lymphocytes", "Patient Lymphocyte count, should have a min value on 1500")

And just like that, we have many different axes of variation for our patient records to enhance the diversity of our resulting dataset.

In [None]:
preview_output = designer.preview()
preview_output.display_sample_record()

## Extending Categorical Columns with Magic


Sampling from categorical distributions is a common need when designing datasets. Sometimes, you already know every possible value you want to sample from, e.g.

```python
SamplerColumn(
    name="stoplight_color",
    type="category",
    params={"values": ["red", "yellow", "green"]}
)
```

However, when using categories to _increase diversity_, it might be hard to know exactly _what_ values to add to a category. Magic has a specific tool for this called `magic.extend_category`, which can be used to add new entries to an existing category definition.

Let's take a look at the `hospital_status` column. What other possible status entries could be added?

In [None]:
designer.add_column(
    SamplerColumn(
      name='hospital_status',
      type='category',
      params={'values': ['IN_ROOM', 'ER', 'RELEASED'], 'weights': None},
      conditional_params={},
      convert_to=None
  )
)

designer.magic.extend_category("hospital_status", n=3)

We can see that a few new values were added to `hospital_status`. Note how these new values are not only "in distribution" (i.e. being likely values for this status category), but also match formatting (uppercase, here).


In [None]:
designer.add_column(
    SamplerColumn(
      name='current_hospital_patient_activity',
      type='category',
      params={'values': ['sleeping', 'doing physical therapy', 'getting bloodwork']},
  )
)

## You can do this several times
designer.magic.extend_category("current_hospital_patient_activity", n=10)
designer.magic.extend_category("current_hospital_patient_activity", n=10)

In [None]:
preview_output = designer.preview()
preview_output.display_sample_record()

# 🪴 Intermediate: Seed Datasets & Prompting

---

Now that we've gotten a feel for Magic with sampling columns, let's see how we can use Magic to help us with LLM generation columns. Magic has two functions available to us for these kinds of tasks, specifically.

1. `magic.add_column(...)` -- for adding or updating an LLM-generated column.
2. `magic.refine_prompt(...)` -- for varying or changing an existing prompt template.

To get started, let's load a a new data designer object from a pre-existing dataset on HuggingFace🤗. Let's use the [gretelai/symptom_to_diagnosis](https://huggingface.co/datasets/gretelai/symptom_to_diagnosis) synthetic dataset, which consists of patient descriptions of their symptoms and a diagnosis label.

In [None]:
from datasets import load_dataset

df_seed = (
    load_dataset("gretelai/symptom_to_diagnosis")
    ["train"]
    .to_pandas()
    .rename(
        columns={"output_text": "diagnosis", "input_text": "patient_symptoms"}
    )
)

df_seed

Now, to start from this seed dataset, let's create a new data designer object. Below, we specify `with_seed_dataset`, give it the `df_seed` we loaded above, and then tell it to "shuffle" rows with replacement. This setting will let us generate an arbitrary amount of data conditioned on `df_seed`.

In [None]:
designer = gretel.data_designer.new(model_suite="apache-2.0")

# Create from the seed dataset
designer.with_seed_dataset(
      df_seed,
      sampling_strategy="shuffle",
      with_replacement=True,
)

We can see from the state of `designer` above that it is aware of `diagnosis` and `patient_symptoms` from the seed dataset. Note that these are listed as `seed_columns` -- these columns are _not_ editable.

To get started, let's use magic to add a new LLM generated column for a doctor's notes about the patient.

In [None]:
designer.magic.add_column(
    "doctors_notes",
    "A bulleted set of notes that a doctor would have written down while listening to the patient."
)

That's a lot of information to take in, let's zoom in on the generated prompt template. Since all prompt templates are given as Jinja templates, we'll use `rich` to highlight this for us.

In [None]:
import rich
from rich.panel import Panel
from rich.syntax import Syntax

rich.print(
    Syntax(
      designer.get_column("doctors_notes").prompt,
      "jinja",
      word_wrap=True
  )
)

We can see from the above Jinja prompt template that `magic.add_column(...)` was able to find and refer to existing columns (`diagnosis` and `patient_symptoms`), and was able to put these into a reasonable prompt template.



## Refining Prompt Templates with Magic

However, in this prompt template, we see that the the `diagnosis` label is given -- to write a realistic set of doctors notes, we want the perspective of the doctor to be that of an _investigator_ -- they don't know the actual diagnosis yet!

Let's use `magic.refine_prompt` to make this edit for us.

In [None]:
designer.magic.refine_prompt(
    "doctors_notes",
    "Remove all references to diagnosis. The doctor's notes must be written from the perspective of a doctor attempting to find a diagnosis."
)

rich.print(
    Syntax(
      designer.get_column("doctors_notes").prompt,
      "jinja",
      word_wrap=True
  )
)

Great! We see that `magic.refine_prompt` was able to edit the starting prompt to be more in line with what we wanted.

We can also use `magic.refine_prompt(...)` to _vary_ an existing prompt without changing its meaning. This is useful for rephrasing prompts to get different outputs.

In [None]:
original_prompt = designer.get_column("doctors_notes").prompt

designer.magic.refine_prompt("doctors_notes")

new_prompt = designer.get_column("doctors_notes").prompt

## Visualize the differences in the original and varied prompt
import difflib

diff_str = "\n".join(
    difflib.ndiff(
        original_prompt.splitlines(),
        new_prompt.splitlines()
    )
)

rich.print(
    Syntax(diff_str, "diff", word_wrap=True)
)

Now, let's visualize what this dataset would look line.

In [None]:
preview_output = designer.preview()
preview_output.display_sample_record()

# 🌳 Advanced: Iterating with Magic

---

Sometimes, what you want to get to for a dataset configuration is a little bit complex, and perhaps you need to see what the outputs of the data generation process might be to help refine column definitions. Both `magic.add_column(...)` and `magic.add_sampling_column(...)` support the kwargs `interactive=` and `preview=` which can be set to `True` to enter "interactive mode".

Interactive mode is an experimental text-command loop structure that you can use to instruct column configuration settings. This loop accepts the following text commands:

- `accept` -- take the current column configuration settings and stop the interactive session.
- `cancel` -- stop the interactive session and revert to the state prior to the session.
- `start-over` -- revert to the original state prior to the session and try again from the top.
- `retry` -- re-run the last instruction to get a different result.
- `preview` -- immediately run generation and visualize outputs for this column.
- `preview-on` -- Turn on auto-preview with each instruciton and run preview immediately.
- `preview-off` -- Turn off auto-preview (only display config).
- Anything else -- Interpreted as an instruction command.

In [None]:
# Set to True if you'd like to test the interactive experience
interactive = False
designer.magic.add_column(
    "likely_diagnoses",
    "A structured list of top-3 possible diagnoses",
    must_depend_on=["doctors_notes"],
    interactive=interactive,
    preview=True
)

# 🌳 Advanced: Refining Jinja Logic in Prompt Templates

---

All prompt templates are written with a reduced subset of Jinja, allowing access to basic filters, logic, and flow-control. You can use `magic.refine_prompt` not just for simple reformatting or variations, but also for helping you get desired, conditional prompt templating using Jinja -- or even structured data referencing. All of these are possible by hand, but `magic.refine_prompt` gives ready access to a quick tool for prototyping different Jinja templating approaches.

Let's consider a case where we have access to a person object, and we want to vary an LLM generation column based on the data found in that column.

In [None]:
from pydantic import BaseModel, Field

class Fruit(BaseModel):
  name: str
  quantity: int
  unit_price_local_currency: float
  currency: str

class FruitstandInventory(BaseModel):
  inventory: list[Fruit] = Field(..., min_length=3, max_length=15)

designer = gretel.data_designer.new(model_suite="apache-2.0")

designer.add_column(
    name="farmer",
    type="person"
)

designer.add_column(
    name="country",
    type="category",
    params={"values": ["USA", "France"]}
)

designer.magic.extend_category("country")

designer.add_column(
    name="fruit_inventory",
    type="llm-structured",
    prompt="Items available at a side-of-the-road fruitstand in {{ country }}.",
    output_format=FruitstandInventory
)

designer.add_column(
    name="farmer_question",
    prompt="Ask the farmer a question about the price of apples."
)

In [None]:
import rich
from rich.syntax import Syntax

designer.magic.refine_prompt("farmer_question", "Update the prompt to ensure that the farmer is as 'Farmer first_name' -- fill in whatever their first name is.")
designer.magic.refine_prompt("farmer_question", "Use the Jinja '| random' filter to select a random fruit from one of the available fruits.", must_depend_on=["fruit_inventory"])


rich.print(
    Syntax(
      designer.get_column("farmer_question").prompt,
      "jinja",
      word_wrap=True
  )
)

In [None]:
preview_output = designer.preview()
preview_output.display_sample_record()