# Synthetic Dialogue Generation

<p align="right" style="margin-right: 8px;">
    <a target="_blank" href="https://colab.research.google.com/github/idiap/sdialog/blob/main/tutorials/1.single_llm_full_generation.ipynb">
        <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
    </a>
</p>


## Getting started

### Environment Setup

Let's first check if our environment is all set up.

In [28]:
# Setup the environment depending on weather we are running in Google Colab or Jupyter Notebook
from IPython import get_ipython

if "google.colab" in str(get_ipython()):
    print("Running on CoLab")
    # Downloading only the "output" directory from the repository
    !git init .
    !git remote add -f origin https://github.com/Play-Your-Part/tutorials.git
    !git config core.sparseCheckout true
    !echo "output" >> .git/info/sparse-checkout
    !git pull origin main

    # Installing Ollama
    !curl -fsSL https://ollama.com/install.sh | sh

    # Installing sdialog
    !git clone https://github.com/idiap/sdialog.git
    %cd sdialog
    %pip install -e .
    %cd ..
else:
    print("Running in Jupyter Notebook")
    # Little hack to avoid the "OSError: Background processes not supported." error in Jupyter notebooks"
    import os
    get_ipython().system = os.system

Running in Jupyter Notebook


> ⚠️ If you're using **Colab**, please, **restart the runtime** once everything above is installed

### Ollama Setup

Let's run the ollama server first, as a background process:

In [29]:
!OLLAMA_KEEP_ALIVE=-1 ollama serve > /dev/null 2>&1 &
!sleep 10  # Let's wait a bit for the server to start

0

Let's change the default sdialog model to `qwen2.5:14b`:

In [None]:
import sdialog

sdialog.config.llm("qwen2.5:14b")

### Defining the Output (Dialogue)

We will begin by defining the JSON objects that we will use to represent the generated dialogues. For now this object will have only three fields: `"model"`, `"seed"`, `"scenario"`, and `"dialog"` to store the name of the model and the seed used to generate the dialogue, as well as the scenario associated to the dialogue and the dialogue itself, respectively. More preciselly, the `"dialog"` field will contain the list of turns of the conversation in order, with the speaker name and the corresponding utterances. As shown in the following example:

In [30]:
example_dialogue = {
    "model": "qwen2.5:14b",  # the model used to generate the dialogue
    "seed": 123,  # the seed used to generated
    "scenario": "short hello and good bye conversation",  # the scenario used to generated the dialogue
    "turns": [
        {"speaker": "Alice", "text": "Hey Bob!"},
        {"speaker": "Bob", "text": "Hey Alice!"},
        {"speaker": "Alice", "text": "Bye Bob!"},
        {"speaker": "Bob", "text": "Bye bye!"},
    ]
}

We can use `pydantic` to properly define our `Dialogue` type:

In [31]:
from pydantic import BaseModel
from typing import List, Union, Optional

class Turn(BaseModel):
    speaker: str
    text: str

class Dialog(BaseModel):
    model: str  # the model used to generate the dialogue
    seed: int  # the seed used to generated
    scenario: Optional[Union[dict, str]] = None  # the scenario used to generated the dialogue
    turns: List[Turn]  # the list of turns of the conversation

Having a Python `pydantic` class to formally represent our dialogues is quite useful, we can convert any JSON dialogue to our `Dialog` class as follows:

In [32]:
my_dialogue = Dialog.model_validate(example_dialogue)
my_dialogue

Dialog(model='qwen2.5:14b', seed=123, scenario='short hello and good bye conversation', turns=[Turn(speaker='Alice', text='Hey Bob!'), Turn(speaker='Bob', text='Hey Alice!'), Turn(speaker='Alice', text='Bye Bob!'), Turn(speaker='Bob', text='Bye bye!')])

Or the opposite, convert our `Dialog`s to a `dict` or a JSON as follows:

In [33]:
my_dialogue.model_dump()  # a dict

{'model': 'qwen2.5:14b',
 'seed': 123,
 'scenario': 'short hello and good bye conversation',
 'turns': [{'speaker': 'Alice', 'text': 'Hey Bob!'},
  {'speaker': 'Bob', 'text': 'Hey Alice!'},
  {'speaker': 'Alice', 'text': 'Bye Bob!'},
  {'speaker': 'Bob', 'text': 'Bye bye!'}]}

In [34]:
my_dialogue_json = my_dialogue.model_dump_json(indent=2)  # a string containing the dialog as a JSON object
print(my_dialogue_json)

{
  "model": "qwen2.5:14b",
  "seed": 123,
  "scenario": "short hello and good bye conversation",
  "turns": [
    {
      "speaker": "Alice",
      "text": "Hey Bob!"
    },
    {
      "speaker": "Bob",
      "text": "Hey Alice!"
    },
    {
      "speaker": "Alice",
      "text": "Bye Bob!"
    },
    {
      "speaker": "Bob",
      "text": "Bye bye!"
    }
  ]
}


Or, of course, create a new `Dialog` from scratch:

In [35]:
Dialog(
    model="qwen2.5:14b",
    seed=123,
    turns=[
        Turn(speaker="Alice", text="Hi :)"),
        Turn(speaker="Bob", text="Bye! :(")
    ]
)

Dialog(model='qwen2.5:14b', seed=123, scenario=None, turns=[Turn(speaker='Alice', text='Hi :)'), Turn(speaker='Bob', text='Bye! :(')])

Alternativelly, we can use the built-in `Dialog` class from `sdialog`:

In [36]:
from sdialog import Dialog

my_dialog = Dialog.model_validate(example_dialogue)
my_dialog

Dialog(version='0.0.2+2b706d22e2adfdd6e9985e45f97ec24d1c2907ce', timestamp='2025-07-09T13:37:12Z', model='qwen2.5:14b', seed=123, id=None, parentId=None, complete=None, personas=None, scenario='short hello and good bye conversation', turns=[Turn(speaker='Alice', text='Hey Bob!'), Turn(speaker='Bob', text='Hey Alice!'), Turn(speaker='Alice', text='Bye Bob!'), Turn(speaker='Bob', text='Bye bye!')], events=None, notes=None)

Which besides providing the exact same functionalities, among other things, allow us to:

- Pretty print the dialogue:

In [37]:
my_dialog.print()

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m123[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[Alice] [0mHey Bob![0m
[94m[Bob] [0mHey Alice![0m
[31m[Alice] [0mBye Bob![0m
[94m[Bob] [0mBye bye![0m
[1m[35m--- Dialogue Ends ---[0m


- Print it in a vanilla textual form:

In [38]:
print(my_dialog)

Alice: Hey Bob!
Bob: Hey Alice!
Alice: Bye Bob!
Bob: Bye bye!


- Save it to a file:

In [39]:
# either as a JSON object
my_dialog.to_file("output/my_dialogue.json")

# or a txt file
my_dialog.to_file("output/my_dialogue.txt")

_(check created files [`output/my_dialogue.json`](output/my_dialogue.json) and [`output/my_dialogue.txt`](output/my_dialogue.txt))_

- Load a dialogue from disk

In [40]:
my_dialog = Dialog.from_file("output/my_dialogue.json")
my_dialog.print(scenario=True)  # `scenario=True` to also print the metadata stored in scenario field

[1m[95m[model] [35mqwen2.5:14b[0m
[1m[95m[seed] [35m123[0m
[1m[95m[scenario] [35m[0m
[35mshort hello and good bye conversation[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[Alice] [0mHey Bob![0m
[94m[Bob] [0mHey Alice![0m
[31m[Alice] [0mBye Bob![0m
[94m[Bob] [0mBye bye![0m
[1m[35m--- Dialogue Ends ---[0m


In [41]:
my_dialog = Dialog.from_file("output/my_dialogue.txt")
my_dialog.print(scenario=True)

[1m[35m--- Dialogue Begins ---[0m
[31m[Alice] [0mHey Bob![0m
[94m[Bob] [0mHey Alice![0m
[31m[Alice] [0mBye Bob![0m
[94m[Bob] [0mBye bye![0m
[1m[35m--- Dialogue Ends ---[0m


- Or simple things like quickly know how long a dialogue is:

In [42]:
len(my_dialog)

4

Now we are ready to begin working on synthetic `Dialog` (;)) generation!

## Dialogue Generation

### Description-driven Generation

We can use the `sdialog`'s built-in `DialogueGenerator` class that we can instantiate using descriptions to generate our dialogues:

In [46]:
from sdialog.generators import DialogGenerator

Which takes the following arguments as input:
- `model` model name to use (any model tag from [ollama hub](https://ollama.com/library)).
- `dialogue_details` the details about the desired dialogue.
- `output_format` the output format as a `pydantic` class or JSON scheme, as we did above (`LLMDialogOutput` by default).
- `scenario` an optional metadata field that describes the scenario used to generated dialogue.

For instance, let's create an instance of `DialogGenerator` to generate conversations between Bob and Alice about her birthday:

In [59]:
dialog_generator = DialogGenerator(
    dialogue_details="The conversation is between a dad (Bob) and his doughter (Alice). "
                     "Her birthday is coming up and she wants to throw a Star Wars themed party.",
)

Then, we can use the build-in `.generate()` method to generate conversations for such instance:

_(each time you run the code below, a different dialogue will be generated)_

In [60]:
dialog = dialog_generator.generate()
dialog.print(scenario=True)

[2025-07-09 15:48:06] INFO:sdialog.generators:Prompt used: [{'type': 'system', 'data': {'content': 'You are a highly capable AI assistant skilled at writing natural, multi-turn conversations by role-playing different speakers. Generate a complete, realistic dialogue from start (greetings) to finish (goodbye), ensuring each speaker remains consistent and the flow is natural and engaging.\n\n', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'system', 'name': None, 'id': None}}, {'type': 'human', 'data': {'content': 'The conversation is between a dad (Bob) and his doughter (Alice). Her birthday is coming up and she wants to throw a Star Wars themed party.', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'human', 'name': None, 'id': None, 'example': False}}]


[1m[95m[dialog_id] [35m1752068886722[0m
[1m[95m[model] [35mmodel='qwen2.5:14b' temperature=0.8 seed=13 format={'$defs': {'Turn': {'description': 'Represents a single turn in a dialogue.\n\n:ivar speaker: The name or role of the speaker.\n:vartype speaker: Optional[str]\n:ivar text: The utterance text for this turn.\n:vartype text: str', 'properties': {'speaker': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Speaker'}, 'text': {'title': 'Text', 'type': 'string'}}, 'required': ['speaker', 'text'], 'title': 'Turn', 'type': 'object'}}, 'description': 'Pydantic model for LLM-generated dialogue output.\n\n:ivar dialog: List of dialogue turns.\n:vartype dialog: List[Turn]', 'properties': {'dialog': {'items': {'$ref': '#/$defs/Turn'}, 'title': 'Dialog', 'type': 'array'}}, 'required': ['dialog'], 'title': 'LLMDialogOutput', 'type': 'object'}[0m
[1m[95m[seed] [35m2071202074[0m
[1m[95m[scenario] [35m[0m
[35mThe conversation is between a dad (Bob) and his doughter (A

Let's now change the description, now the party has to be about Lord of the Rings, not Star Wars. To do this we can simply pass the new description when calling the generate method:

In [61]:
dialog = dialog_generator.generate(
    "The conversation is between a dad (Bob) and his doughter (Alice). "
    "Her birthday is coming up and she wants to throw a Lord of the Rings themed party."
)
dialog.print(scenario=True)

[2025-07-09 15:48:20] INFO:sdialog.generators:Prompt used: [{'type': 'system', 'data': {'content': 'You are a highly capable AI assistant skilled at writing natural, multi-turn conversations by role-playing different speakers. Generate a complete, realistic dialogue from start (greetings) to finish (goodbye), ensuring each speaker remains consistent and the flow is natural and engaging.\n\n', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'system', 'name': None, 'id': None}}, {'type': 'human', 'data': {'content': 'The conversation is between a dad (Bob) and his doughter (Alice). Her birthday is coming up and she wants to throw a Lord of the Rings themed party.', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'human', 'name': None, 'id': None, 'example': False}}]


[1m[95m[dialog_id] [35m1752068900414[0m
[1m[95m[model] [35mmodel='qwen2.5:14b' temperature=0.8 seed=13 format={'$defs': {'Turn': {'description': 'Represents a single turn in a dialogue.\n\n:ivar speaker: The name or role of the speaker.\n:vartype speaker: Optional[str]\n:ivar text: The utterance text for this turn.\n:vartype text: str', 'properties': {'speaker': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Speaker'}, 'text': {'title': 'Text', 'type': 'string'}}, 'required': ['speaker', 'text'], 'title': 'Turn', 'type': 'object'}}, 'description': 'Pydantic model for LLM-generated dialogue output.\n\n:ivar dialog: List of dialogue turns.\n:vartype dialog: List[Turn]', 'properties': {'dialog': {'items': {'$ref': '#/$defs/Turn'}, 'title': 'Dialog', 'type': 'array'}}, 'required': ['dialog'], 'title': 'LLMDialogOutput', 'type': 'object'}[0m
[1m[95m[seed] [35m2167019297[0m
[1m[95m[scenario] [35m[0m
[35mThe conversation is between a dad (Bob) and his doughter (A

You can use the `seed` number above to re-generate the exact same dialogue each time as an argument of `generate()`, as follows:

In [64]:
dialog_generator.generate(
        "The conversation is between a dad (Bob) and his doughter (Alice). "
        "Her birthday is coming up and she wants to throw a Lord of the Rings themed party.",
        seed=2167019297
).print()

[2025-07-09 15:50:03] INFO:sdialog.generators:Prompt used: [{'type': 'system', 'data': {'content': 'You are a highly capable AI assistant skilled at writing natural, multi-turn conversations by role-playing different speakers. Generate a complete, realistic dialogue from start (greetings) to finish (goodbye), ensuring each speaker remains consistent and the flow is natural and engaging.\n\n', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'system', 'name': None, 'id': None}}, {'type': 'human', 'data': {'content': 'The conversation is between a dad (Bob) and his doughter (Alice). Her birthday is coming up and she wants to throw a Lord of the Rings themed party.', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'human', 'name': None, 'id': None, 'example': False}}]


[1m[95m[dialog_id] [35m1752069003606[0m
[1m[95m[model] [35mmodel='qwen2.5:14b' temperature=0.8 seed=13 format={'$defs': {'Turn': {'description': 'Represents a single turn in a dialogue.\n\n:ivar speaker: The name or role of the speaker.\n:vartype speaker: Optional[str]\n:ivar text: The utterance text for this turn.\n:vartype text: str', 'properties': {'speaker': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Speaker'}, 'text': {'title': 'Text', 'type': 'string'}}, 'required': ['speaker', 'text'], 'title': 'Turn', 'type': 'object'}}, 'description': 'Pydantic model for LLM-generated dialogue output.\n\n:ivar dialog: List of dialogue turns.\n:vartype dialog: List[Turn]', 'properties': {'dialog': {'items': {'$ref': '#/$defs/Turn'}, 'title': 'Dialog', 'type': 'array'}}, 'required': ['dialog'], 'title': 'LLMDialogOutput', 'type': 'object'}[0m
[1m[95m[seed] [35m2167019297[0m
[1m[35m--- Dialogue Begins ---[0m
[94m[Bob] [0mHey Alice, what's on your mind today?[0m


### Role-Playing-based Generation

Our goal here will be to have to LLM to generate the dialogues by role-playing the different charecters.

Each character will be fully defined by its persona, so, the same way started this tutorial by defining what a "synthetic `Dialog`" will actually be, we should now define our `Persona`.

Hopefully, the `sdialog` contains a `BasePersona` that we can import to create our own custom persona classes, let's import it:

In [65]:
from sdialog.personas import BasePersona

And let's define our concrete `Persona` class now by specifying some useful attributes like a name, role, background, etc:

In [66]:
class Persona(BasePersona):
    name: str = ""
    role: str = ""
    background: str = ""
    personality: str = ""
    circumstances: str = ""
    rules: str = ""

Now we can create/instantiate any `Persona` we want for our characters. Let's create our Bob and Alice:

In [67]:
bob_persona = Persona(
        name="Bob",
        role="great dad",
        circumstances="Your daughter will talk to you",
        background="Computer Science PhD.",
        personality="an extremely happy person that likes to help people",
)

alice_persona = Persona(
    name="Alice",
    role="lovely daughter",
    circumstances="Your birthday is getting closer and you are talking with your dad to organize the party."
                  "You want your party to be themed as Lord of The Rings."
)

If we print `bob` we will see that it is automatically converted to natural language description, which is usefull when we want to create the actual prompt for our LLM.

In [68]:
print(bob_persona)

* Name: Bob
* Role: great dad
* Background: Computer Science PhD.
* Personality: an extremely happy person that likes to help people
* Circumstances: Your daughter will talk to you


In case we are working with a really complex persona and this default description is not good for your needs, you can overwrite it by defining your own `description()` method as in the following example:

In [69]:
class PersonaCustom(BasePersona):
    name: str = ""
    role: str = ""

    def description(self):
        return f"Your awesome name is {self.name} and your awesome role is being a {self.role}"

awesome_bob = PersonaCustom(
        name="Bob",
        role="great dad"
)

# Let's print "awesome_bob" persona
print(awesome_bob)

Your awesome name is Bob and your awesome role is being a great dad


So, we now know how to create personas, let's move to the fun part which is actually creating the actual generator.

Fortunatelly, we can simply use `sdialog`'s built-in `PersonaDialogGenerator` class to generate our persona-based dialogues as follows:

In [71]:
from sdialog.generators import PersonaDialogGenerator

dialog_generator = PersonaDialogGenerator(
    persona_a=bob_persona,
    persona_b=alice_persona,
)

dialog_generator.generate().print()

[2025-07-09 15:50:46] INFO:sdialog.generators:Prompt used: [{'type': 'system', 'data': {'content': 'You are a highly capable AI assistant skilled at writing natural, multi-turn conversations by role-playing different speakers. Generate a complete, realistic dialogue from start (greetings) to finish (goodbye), ensuring each speaker remains consistent and the flow is natural and engaging.\n\n', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'system', 'name': None, 'id': None}}, {'type': 'human', 'data': {'content': '# Improved two-persona dialogue prompt\nRole-play a conversation between the following two characters. Each character is defined by their persona below in JSON format. Remain in character for both roles at all times.\n\n[[ ## BEGIN FIRST PERSONA ## ]]\n{\n  "name": "Bob",\n  "role": "great dad",\n  "background": "Computer Science PhD.",\n  "personality": "an extremely happy person that likes to help people",\n  "circumstances": "Your daughter will talk to you"\n}\n

[1m[95m[dialog_id] [35m1752069046672[0m
[1m[95m[model] [35mmodel='qwen2.5:14b' temperature=0.8 seed=13 format={'$defs': {'Turn': {'description': 'Represents a single turn in a dialogue.\n\n:ivar speaker: The name or role of the speaker.\n:vartype speaker: Optional[str]\n:ivar text: The utterance text for this turn.\n:vartype text: str', 'properties': {'speaker': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Speaker'}, 'text': {'title': 'Text', 'type': 'string'}}, 'required': ['speaker', 'text'], 'title': 'Turn', 'type': 'object'}}, 'description': 'Pydantic model for LLM-generated dialogue output.\n\n:ivar dialog: List of dialogue turns.\n:vartype dialog: List[Turn]', 'properties': {'dialog': {'items': {'$ref': '#/$defs/Turn'}, 'title': 'Dialog', 'type': 'array'}}, 'required': ['dialog'], 'title': 'LLMDialogOutput', 'type': 'object'}[0m
[1m[95m[seed] [35m4124850050[0m
[1m[35m--- Dialogue Begins ---[0m
[94m[Bob] [0mHi Alice![0m
[31m[Alice] [0mHello Dad! 

## Use Case: Dialogue Generation for STAR Dataset

### Introduction

Before we begin this section, make sure you have the STAR dataset downloaded in your system, inside the `datasets` folder:

In [None]:
# Let's clone the STAR dataset repository
!git clone https://github.com/RasaHQ/STAR.git datasets/STAR

# Let's check that `dialogues` and `tasks` folders are inside `datasets/STAR`
!ls datasets/STAR

The [STAR](https://arxiv.org/pdf/2010.11853) dataset contains 6652 human-generated dialogues as JSON objects where files are named as `NUMBER.json`.

Humans had to follow a well-defined set of instruction to generate the dialogue role playing the system (wizard) and the client (user).

For instance, clicking [here](datasets/STAR/dialogues/1.json) we can open the file [`1.json`](datasets/STAR/dialogues/1.json) containing the first dialogue. For now, let's focus only on the `"Scenario"` field.

For instance, for the dialogue in `1.json` it is as follows:

```json
{
    "Domains": [  # List of domains
        "doctor"
    ],
    "Happy": true,  # Wheather or not the dialogue follos a happy path
    "MultiTask": false,  # Wheather or not this dialogue involves more than one task
    "UserTask": "You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.",
    "WizardTask": "Inform the user of his/her doctor's orders.",
    "WizardCapabilities": [  # List of flowcharts describing the each task the Wizard is cable of doing
        {
        "Domain": "doctor",
        "SchemaImage": "doctor_followup.jpg",
        "Task": "doctor_followup"
        }
    ]
}
```

We can use the `STAR` module from `sdialog` to read scenarios object from any dialogue in STAR given it's id given a STAR conversation id as follows:

In [73]:
from sdialog.datasets import STAR

# Let's first indicate where the dataset is located
STAR.set_path("datasets/STAR/")

# Let's set the first dialogue as the target example
TARGET_DIALOG = 1

# Let's load the scenario of the first dialog
scenario = STAR.get_dialog_scenario(TARGET_DIALOG)
scenario

{'Domains': ['doctor'],
 'Happy': True,
 'MultiTask': False,
 'UserTask': 'You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.',
 'WizardCapabilities': [{'Domain': 'doctor',
   'SchemaImage': 'doctor_followup.jpg',
   'Task': 'doctor_followup'}],
 'WizardTask': "Inform the user of his/her doctor's orders."}

Which corresponds to the following dialogue:

In [74]:
original_dialogue = STAR.get_dialog(1)
original_dialogue.print()

[1m[95m[dialog_id] [35m1[0m
[1m[35m--- Dialogue Begins ---[0m
[94m[User] [0mHello, I'm really worried. I forgot what I'm supposed to do and forgot to write it down... What do I do?[0m
[31m[System] [0mCould I get your name, please?[0m
[94m[User] [0mMy name is Alexis and my last doctor was Dr. Morgan, but now my doctor is Dr. Johnson and I forgot how to take my medicine.[0m
[31m[System] [0mYour instructions are: Take your medicine before you go to sleep. If you experience nausea, please contact your doctor immediately..[0m
[94m[User] [0mAre you sure I'm supposed to take it before bed? I don't go to sleep every day because my sleep schedule is totally off right now because of the Coronavirus.[0m
[31m[System] [0mYes. It must be before bed or it will not be effective.[0m
[94m[User] [0mOkay thank you. I will get back in touch if this doesn't help.[0m
[31m[System] [0mThank you and goodbye.[0m
[1m[35m--- Dialogue Ends ---[0m


### Description-driven Generation

In the `scenario`, we can see that in this conversation, the user's behavior is defined by instructions given in natural language (`"UserTask"`), however, the system/wizard behavior is more rigidly defined as a graph describing the dialogue policy to followed (since system was expected to be more deterministic). These graphs are described as JSON objects storing the graph edges as key:value pairs (source:destination). We can find these graphs in the [`STAR/tasks`](datasets/STAR/tasks) folder.

Ideally, we would like our `DialogGenerator` to generate dialogues for each different scenario. That is, given a `scenario` we would like generate multiple dialogues for it.

To achieve this, we only need to find a way to describe each `scenario` using natural language so that we can pass it to our `DialogGenerator`.

Fortunately, we can use the built-in `get_scenario_description()` method to do this, which takes an `scenario` as input and returns its natural language description containing all the details (including the system behavior described by the graphs):

In [75]:
print(STAR.get_scenario_description(scenario))

The conversation is between a User and a AI assistant in the following domains: doctor.

The User instructions are: You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.
The AI assistant instructions are: Inform the user of his/her doctor's orders.

In addition, the AI assistant is instructed to follow specific flowcharts to address the tasks. Flowcharts are defined as graph described using DOT.
The actual DOT for the current tasks are:

The graph for the task 'doctor_followup' with domain 'doctor' is:
```dot
digraph doctor_followup  {
    hello -> ask_name;
    ask_name -> doctor_ask_doctor_name;
    doctor_ask_doctor_name -> query;
    query -> doctor_inform_doctors_instructions;
    doctor_inform_doctors_instructions -> anything_else
}
```
and one example responses for each node is provided in the following json:
```json
{
  "hello": "H

Note how the original graph in JSON describing the system's behavior have been converted to a [DOT](https://en.wikipedia.org/wiki/DOT_(graph_description_language)) description which should be easier to interpret by the LLM, since DOT is a well-known format to describe graphs in plain text. 

Let's now put these two methods together and create a function that given a STAR dialogue ID will generate a natural language description of the scenario associated to it, simply as follows:

In [76]:
def get_dialog_scenario_description(dialogue_id):
    # Get the scenario of the target dialogue
    scenario = STAR.get_dialog_scenario(dialogue_id)
    # Then return it along its description in natural language
    return scenario, STAR.get_scenario_description(scenario)

With this function now we have everything we need to generate synthethic dialogues for STAR that follows the same scenario as a given target real dialogue.

For instance, let's say we want to generate dialogues following the same scenario as the first STAR dialogue, we can simply:

In [77]:
# First, let's get the scenario and description of the first dialogue
scenario, description = get_dialog_scenario_description(dialogue_id=1)

# le'ts now create a dialogue generator for it
dialog_generator = DialogGenerator(
    dialogue_details=description,
    scenario=scenario
)

Let's now generate multiple conversation that follows the same scenario as dialogue 1 of STAR dataset.
> **Note**
> Run the cell multiple times to get different conversations for it

In [78]:
dialog = dialog_generator.generate()
dialog.print(scenario=True)

[2025-07-09 15:51:33] INFO:sdialog.generators:Prompt used: [{'type': 'system', 'data': {'content': 'You are a highly capable AI assistant skilled at writing natural, multi-turn conversations by role-playing different speakers. Generate a complete, realistic dialogue from start (greetings) to finish (goodbye), ensuring each speaker remains consistent and the flow is natural and engaging.\n\n', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'system', 'name': None, 'id': None}}, {'type': 'human', 'data': {'content': 'The conversation is between a User and a AI assistant in the following domains: doctor.\n\nThe User instructions are: You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.\nThe AI assistant instructions are: Inform the user of his/her doctor\'s orders.\n\nIn addition, the AI assistant is instructed to follow specific f

[1m[95m[dialog_id] [35m1752069093284[0m
[1m[95m[model] [35mmodel='qwen2.5:14b' temperature=0.8 seed=13 format={'$defs': {'Turn': {'description': 'Represents a single turn in a dialogue.\n\n:ivar speaker: The name or role of the speaker.\n:vartype speaker: Optional[str]\n:ivar text: The utterance text for this turn.\n:vartype text: str', 'properties': {'speaker': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Speaker'}, 'text': {'title': 'Text', 'type': 'string'}}, 'required': ['speaker', 'text'], 'title': 'Turn', 'type': 'object'}}, 'description': 'Pydantic model for LLM-generated dialogue output.\n\n:ivar dialog: List of dialogue turns.\n:vartype dialog: List[Turn]', 'properties': {'dialog': {'items': {'$ref': '#/$defs/Turn'}, 'title': 'Dialog', 'type': 'array'}}, 'required': ['dialog'], 'title': 'LLMDialogOutput', 'type': 'object'}[0m
[1m[95m[seed] [35m2449260168[0m
[1m[95m[scenario] [35m[0m
[35m{
  "Domains": [
    "doctor"
  ],
  "Happy": true,
  "Mult

We can see the LLM is able to follow the scenario surprisingly well, specially for the system part which is guided by a graph with pre-defined responses.

Now update the generator to match a more challenging scenario, let's say that of dialogue 5100 that is multi-task and does not follow a happy path:

In [79]:
scenario, description = get_dialog_scenario_description(5100)

dialog_generator = DialogGenerator(
    dialogue_details=description,
    scenario=scenario
)
scenario

{'Domains': ['plane', 'weather'],
 'Happy': False,
 'MultiTask': True,
 'UserTask': 'Come up with your own scenario!\n\nAbout you:\n- Your name: Ben\n\n The AI Assistant can handle:\n- Search for a flight (e.g. from Chicago to Pittsburgh)\n- Book a flight (e.g. with id 193)\n- Checking the weather forecast in different Cities (e.g. Chicago or Pittsburgh)',
 'WizardCapabilities': [{'Domain': 'plane',
   'SchemaImage': 'plane_search.jpg',
   'Task': 'plane_search'},
  {'Domain': 'plane', 'SchemaImage': 'plane_book.jpg', 'Task': 'plane_book'},
  {'Domain': 'weather', 'SchemaImage': 'weather.jpg', 'Task': 'weather'}],
 'WizardTask': 'Follow the flow charts and help the user.'}

And generate dialogues for it:

_(run multi-times the call to generate different ones)_

In [80]:
dialog_generator.generate().print()

[2025-07-09 15:54:04] INFO:sdialog.generators:Prompt used: [{'type': 'system', 'data': {'content': 'You are a highly capable AI assistant skilled at writing natural, multi-turn conversations by role-playing different speakers. Generate a complete, realistic dialogue from start (greetings) to finish (goodbye), ensuring each speaker remains consistent and the flow is natural and engaging.\n\n', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'system', 'name': None, 'id': None}}, {'type': 'human', 'data': {'content': 'The conversation is between a User and a AI assistant in the following domains: plane, weather.\n\nThe User instructions are: Come up with your own scenario!\n\nAbout you:\n- Your name: Ben\n\n The AI Assistant can handle:\n- Search for a flight (e.g. from Chicago to Pittsburgh)\n- Book a flight (e.g. with id 193)\n- Checking the weather forecast in different Cities (e.g. Chicago or Pittsburgh)\nThe AI assistant instructions are: Follow the flow charts and help the

[1m[95m[dialog_id] [35m1752069244977[0m
[1m[95m[model] [35mmodel='qwen2.5:14b' temperature=0.8 seed=13 format={'$defs': {'Turn': {'description': 'Represents a single turn in a dialogue.\n\n:ivar speaker: The name or role of the speaker.\n:vartype speaker: Optional[str]\n:ivar text: The utterance text for this turn.\n:vartype text: str', 'properties': {'speaker': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Speaker'}, 'text': {'title': 'Text', 'type': 'string'}}, 'required': ['speaker', 'text'], 'title': 'Turn', 'type': 'object'}}, 'description': 'Pydantic model for LLM-generated dialogue output.\n\n:ivar dialog: List of dialogue turns.\n:vartype dialog: List[Turn]', 'properties': {'dialog': {'items': {'$ref': '#/$defs/Turn'}, 'title': 'Dialog', 'type': 'array'}}, 'required': ['dialog'], 'title': 'LLMDialogOutput', 'type': 'object'}[0m
[1m[95m[seed] [35m33572775[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[AI Assistant] [0mHello, how can I help?[0m
[94m[Be

### Role-playing-based Generation

Before, in previous section we only had to find a way to describe each `scenario` using natural language so that we can pass it to our `DialogGenerator`.

Likewise, now we have to find a way to create the right system and user personas for each scenario which means we have to return the right system and user `Persona`s for a given `scenario`.

Fortunately, we can use the built-in `STAR.get_user_persona_for_scenario(scenario)` and `STAR.get_system_persona_for_scenario(scenario)` methods to achieve this.

For instance, let's get the user persona for the `scenario` of the first dialogue above:

In [81]:
scenario = STAR.get_dialog_scenario(TARGET_DIALOG)

user_persona = STAR.get_user_persona_for_scenario(scenario)
print(user_persona)

* Language: English
* Role: user calling a AI assistant that can perform multiple tasks in the following domains: doctor.

The following should be considered regarding the conversation:
   1. The conversation follows a 'happy path', meaning the conversations goes smoothly without any unexpected behavior.
   2. The conversation involves only one task you were instructed to (doctor_followup), nothing else
* Circumstances: You (Alexis) had an appointment with Dr. Morgan the other day. Unfortunately, you forgot to write down the instructions the doctor gave you. Please followup and find out how often to take your medicine.


Now, similar to what we did in the previous subsection, we just need to define a function that given a dialogue ID can return its scenario as well as the system and user persona for it: 

In [82]:
def get_dialog_scenario_and_personas(dialogue_id):
    # Get the scenario of the target dialogue
    scenario = STAR.get_dialog_scenario(dialogue_id)
    # Get the personas
    system = STAR.get_system_persona_for_scenario(scenario)
    user = STAR.get_user_persona_for_scenario(scenario)
    return scenario, system, user

And that's it, not we can simply create a `PersonaDialogGenerator` using the system and user personas as follows:

In [86]:
# let's get the personas and the scenario
scenario, system, user = get_dialog_scenario_and_personas(dialogue_id=1)

# le'ts now create a dialogue generator for it
dialog_generator = PersonaDialogGenerator(
    persona_a=system,
    persona_b=user,
    scenario=scenario
)

# let's generate the dialogue
dialog_generator.generate().print()

[2025-07-09 15:55:10] INFO:sdialog.generators:Prompt used: [{'type': 'system', 'data': {'content': 'You are a highly capable AI assistant skilled at writing natural, multi-turn conversations by role-playing different speakers. Generate a complete, realistic dialogue from start (greetings) to finish (goodbye), ensuring each speaker remains consistent and the flow is natural and engaging.\n\n', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'system', 'name': None, 'id': None}}, {'type': 'human', 'data': {'content': '# Improved two-persona dialogue prompt\nRole-play a conversation between the following two characters. Each character is defined by their persona below in JSON format. Remain in character for both roles at all times.\n\n[[ ## BEGIN FIRST PERSONA ## ]]\n{\n  "language": "English",\n  "role": "AI assistant.\\nIn the conversation, the AI assistant is instructed to follow specific action flowcharts to address the tasks. Flowcharts are defined as graph described using D

[1m[95m[dialog_id] [35m1752069310746[0m
[1m[95m[model] [35mmodel='qwen2.5:14b' temperature=0.8 seed=13 format={'$defs': {'Turn': {'description': 'Represents a single turn in a dialogue.\n\n:ivar speaker: The name or role of the speaker.\n:vartype speaker: Optional[str]\n:ivar text: The utterance text for this turn.\n:vartype text: str', 'properties': {'speaker': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'title': 'Speaker'}, 'text': {'title': 'Text', 'type': 'string'}}, 'required': ['speaker', 'text'], 'title': 'Turn', 'type': 'object'}}, 'description': 'Pydantic model for LLM-generated dialogue output.\n\n:ivar dialog: List of dialogue turns.\n:vartype dialog: List[Turn]', 'properties': {'dialog': {'items': {'$ref': '#/$defs/Turn'}, 'title': 'Dialog', 'type': 'array'}}, 'required': ['dialog'], 'title': 'LLMDialogOutput', 'type': 'object'}[0m
[1m[95m[seed] [35m436821777[0m
[1m[35m--- Dialogue Begins ---[0m
[31m[AI assistant] [0mHello, how can I help?[0m
[94m[u

And that's it for this tutorial, congrats! 😎

### Saving our dialogues

Finally, let's generate one synthetic dialog for each happy `"doctor_followup"` dialog in STAR and save it to disk for later use.

Let's first get all happy dialogues for this task using `sdialog`'s built-in `STAR.get_dialogs()` function:

In [85]:
original_dialogs = STAR.get_dialogs(task_name="doctor_followup", happy=True, multitask=False)
print('Total number of happy "doctor_followup" dialogues in STAR:', len(original_dialogs))

Total number of happy "doctor_followup" dialogues in STAR: 105


Now let's generate the dialogues and save them in the path pointed by the `PATH_OUTPUT` variable.

In [None]:
import os

from tqdm.auto import tqdm

PATH_OUTPUT = "output/STAR/full-generation"

path_txt = os.path.join(PATH_OUTPUT, "txt")
path_json = os.path.join(PATH_OUTPUT, "json")
os.makedirs(path_txt, exist_ok=True)
os.makedirs(path_json, exist_ok=True)

for dialog in tqdm(original_dialogs, desc="Dialog generation"):
    if os.path.exists(os.path.join(path_txt, f"{dialog.id}.txt")):
        continue

    scenario, description = STAR.get_dialog_scenario_description(dialog.id)
    dialog_generator = DialogGenerator(
        dialogue_details=description,
        scenario=scenario
    )
    dialog = dialog_generator.generate(id=dialog.id, seed=dialog.id)

    # Normalize speaker names in each turn (since their also generated by the LLM)
    for turn in dialog.turns:
        turn.speaker = "System" if "AI" in turn.speaker else "User"

    dialog.to_file(os.path.join(path_json, f"{dialog.id}.json"))
    dialog.to_file(os.path.join(path_txt, f"{dialog.id}.txt"))

Finally, let's check the files were generated:

In [46]:
%ls output/STAR/full-generation/

[0m[01;34mjson[0m/
[01;34mtxt[0m/


In [None]:
%ls output/STAR/full-generation/txt

1.txt
1848.txt
1886.txt
1896.txt
1899.txt
1942.txt
1963.txt
1970.txt
2103.txt
2228.txt
2273.txt
2487.txt
2526.txt
2578.txt
2579.txt
2624.txt
2699.txt
2733.txt
3007.txt
3024.txt
3050.txt
3071.txt
3073.txt
3086.txt
3088.txt
3110.txt
3116.txt
3126.txt
3136.txt
3155.txt
3198.txt
3202.txt
3210.txt
3234.txt
3254.txt
3264.txt
3269.txt
3274.txt
3298.txt
3316.txt
3323.txt
3329.txt
3330.txt
3356.txt
3371.txt
3391.txt
3403.txt
3418.txt
3422.txt
3437.txt
3446.txt
3454.txt
3469.txt
3494.txt
3516.txt
3528.txt
3550.txt
3652.txt
3675.txt
3743.txt
3769.txt
4055.txt
4058.txt
4067.txt
4076.txt
4082.txt
4093.txt
4100.txt
4111.txt
4159.txt
4173.txt
4191.txt
4205.txt
4214.txt
4224.txt
4231.txt
4233.txt
4253.txt
4255.txt
4263.txt
4341.txt
4349.txt
4358.txt
4381.txt
4395.txt
4403.txt
4416.txt
4459.txt
4468.txt
4477.txt
4502.txt
4515.txt
4524.txt
4536.txt
4570.txt
4591.txt
4623.txt
4653.txt
4737.txt
4743.txt
4752.txt
4851.txt
4870.txt
4875.txt
9.txt


## Exercise: Doctor-Patient Conversations

Suppose now you have to generate synthetic doctor-patient conversations, is it better to do it using the "description" or the "role-playing" approach?

- How would you define the Patient and Doctor personas?

In [49]:
# TODO: do your magic!
class DoctorPersona(BasePersona):
    pass

class PatientPersona(BasePersona):
    pass

- How would you initialize them? (e.g. concrete values for each defined attribute)

In [None]:
doctor = DoctorPersona(
    # TODO: assign values to the attribues!
)
patient = PatientPersona(
    # TODO: assign values to the attribues!
)

Once we have defined our Personas and created our two doctor and patient personas, wwe can simply create a generator for them as we did before in this tutorial:

In [None]:
# le'ts now create a dialogue generator for our doctor and patients
dialog_generator = PersonaDialogGenerator(
    persona_a=doctor,
    persona_b=patient
)

And generate as many dialogues for them: _(running the cell multiple times)_

In [None]:
# let's generate the doctor-patient dialogue
dialog_generator.generate().print()

- Can you think of an `scenario` object for doctor-patient conversations? What would this scenario contain?

> 💡 **Hint:** what is it that you want to keep track/control when generating the conversations? (different dissises? different outcomes? different skills? etc.)

In [None]:
scenario = {}  # TODO: define your own, perhaps better to use pydantic's BaseModel instead of a dict {}

If we had such `scenario` then we could define a function that could return the right doctor and patient for the provided scenario:

In [None]:
def get_doctor_patient_for_scenario(scenario):
    # TODO: do some magic to return the right doctor
    #       and patient personas for the given scenario
    # doctor = DoctorPersona()
    # patient = DoctorPersona()
    return doctor, patient

Which we could finally use to create generators for different `scenarios`:

In [None]:
def get_generator_for_scenario(scenario):
    # Get the right doctor and patient personas for the given scenario
    doctor, patient = get_doctor_patient_for_scenario(scenario)

    # Create and return a dialogue generator for them
    return PersonaDialogGenerator(
        persona_a=doctor,
        persona_b=patient,
        scenario=scenario,
        # dialogue_details=""  # TODO: optional, in case scenario also requires defining certain properties outside the personas
    )

In [None]:
generator = get_generator_for_scenario(scenario)

Which will allow us to generate multiple dialogues belonging to the same `scenario`:

In [None]:
generator.generate()

Cool, huh? Congrats for finalizing the tutorial! you did a great job! 😎