<a href="https://colab.research.google.com/gist/johnnygreco/7dc7c56679ff405902296c3071b748f6/pokemon_story_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
%pip install -U git+https://github.com/gretelai/gretel-python-client

In [None]:
import pandas as pd

# Load our new Gretel object.
from gretel_client.navigator_client import Gretel

## 🐯 Pokemon Seed Dataset

- To demonstrate how to use your own seed data, we'll use a fun [pokemon dataset](https://calmcode.io/datasets/pokemon-json).

In [None]:
data_seed_file = "https://gretel-datasets.s3.us-west-2.amazonaws.com/calmcode-datasets/pokemon_descriptive_columns.csv"
df_seed = pd.read_csv(data_seed_file)
df_seed.head()

## 🎬 Getting Started with AIDD 2.0

- Note the new DEV endpoint. You'll need your DEV APY key.

- The new `DataDesigner` factory is accessed via the `data_designer` attribute.

In [None]:
gretel = Gretel(api_key="prompt", endpoint='https://api.dev.gretel.ai')

aidd = gretel.data_designer.new(model_suite="apache-2.0")

aidd

## 🌱 Injecting seed data into AIDD

- Use the `with_seed_dataset` method to upload your dataset to our platform.

- You now have access to all the fields in the dataset's schema, which you can use in prompt templates.

- Use the `sampling_strategy` argument to control how the data is sampled.

  - `ordered` (default): maintains the order of the seed dataset.
  
  - `shuffled`: randomly shuffles the seed dataset.

- If you plan to generate more data than the seed dataset contains, use `with_replacement=True` .

In [None]:
# The file_id is the unique identifier for the dataset we just uploaded.
# You can use it to reference this dataset in future API calls.
aidd.with_seed_dataset(df_seed, sampling_strategy="shuffle", with_replacement=True)

## 👩‍🚀 🎲 Person Samplers

- You can create reusable person samplers using the `with_person_samplers` method.

- Each person sampler you add will sample a different person _for each row_ of your dataset.

- You can choose the locale of the person sampler. We support all `Faker` [locales](https://faker.readthedocs.io/en/master/locales.html).

- For `locale=en_US`, we use our PGM to generate the person data. For all other locales, we use `Faker`, which means the data-quality is _far_ lower than for `en_US`.

- **IMPORTANT:** The PGM doesn't work in streaming mode at the moment, so we are using a default locale of `en_GB` for our initial testing.

- Sampled persons have a bunch of attributes that you can use in your prompt templates. <br><br> A limited set of attributes include:
    
    - first_name
    - last_name
    - city
    - country
    - marital_status
    - education_level
    - bachelors_field
    - email_address

- The full sampled person objects will be dropped at the end of the generation process. If you want to keep them as structured objects, you
can add them using `add_column` (see below) with `type="person"`

- When creating person samplers, we currently support specification of the person's `sex`, `locale`, and `city`. Note the city must exist within the locale.

In [None]:
aidd.with_person_samplers(
    {
        "main_dude": {"sex": "Male"},
        "french_woman": {"sex": "Female", "locale": "fr_FR"},
        "random_bad_person": {}
    }
)

## 🧱 Building your data schema

- Add columns to your dataset's schema using the `add_column` method.

- Important arguments:

  - `name`: name of the column.
  
  - `type`: the column type, which determines the task that will generate the data. Available types:

    - **llm-generated**: Will use an LLM to generate the data. This is the same as our previous `add_generated_data_column` method.

    - There are many **sampling-based types**: `category`, `subcategory`, `uuid`, `uniform`, `gaussian`, `poisson`, `bernoulli`, `binomial`, `datetime`, `timedelta` <br> (documentation on all the configuration options for these types is coming soon).

    - Two special sampling-based types are `person` (sample persons using either our PGM or Faker) and `expression` (mathematical expressions involving other columns).

- Constraints can be applied to numerical samplers using the `add_constraint` method.

<br>

> **Note**: When calling `add_column`, llm-generated columns require a `prompt` and sampling-based columns require `params` for the sampler.

In [None]:
# If you want to keep things compact, you can optionally use this chaining syntax:
(
    aidd
    # You can create columns based on attributes of person objects.
    .add_column(
        name="protagonist_first_name",
        type="expression",
        params={"expr": "main_dude.first_name"}
    )
    .add_column(
        name="protagonist_last_name",
        type="expression",
        params={"expr": "main_dude.last_name"}
    )
    # Categories are a great way to add variety to your data.
    .add_column(
        name="story_theme",
        type="category",
        params={
            "values": [
                "Quest-based narrative",
                "Adventure gone wrong",
                "Unexpected discovery",
                "Coming-of-age",
                "Redemption/second chance",
            ]
        }
    )
    # Subcategories let you create values associated with a parent category.
    .add_column(
        name="theme_details",
        type="subcategory",
        params={
            "category": "story_theme",
            "values": {
                "Quest-based narrative": [
                    "Rescue mission (saving Pokemon from danger)",
                    "Multi-stage journey (collecting items/clues leading to rare Pokemon)",
                    "Mythical pursuit (following legends to find rare Pokemon)",
                    "Personal goal (completing Pokedex, finding specific Pokemon)"
                ],
                "Adventure gone wrong": [
                    "Lost in the wilderness (trying to find way back to civilization)",
                    "Trapped in a cave (escaping from dangerous Pokemon)",
                    "Caught in a storm (finding shelter and food while waiting for rescue)",
                    "Shipwrecked (finding a way to signal for help)"
                ],
                "Unexpected discovery": [
                    "Hidden treasure (finding rare Pokemon in unexpected location)",
                    "Ancient ruins (exploring ancient Pokemon civilization)",
                    "Time travel (visiting past/future to find rare Pokemon)",
                    "Alien encounter (meeting Pokemon from another planet)"
                ],
                "Coming-of-age": [
                    "Rite of passage (proving worth to Pokemon tribe)",
                    "Mentorship (learning from wise Pokemon)",
                    "First significant catch (finding rare Pokemon for first time)",
                    "Epic battle (defeating powerful Pokemon)"
                ],
                "Redemption/second chance": [
                    "Rehabilitation (helping injured Pokemon recover)",
                    "Forgiveness (making amends with Pokemon after past mistake)",
                    "Second chance (finding rare Pokemon after failing first attempt)",
                    "Redemption (saving Pokemon from evil trainer)"
                ]
            }
        }
    )
    # We have numerical samplers for common distributions.
    # There's also a sampler for scipy, which lets you use any
    # distribution available in scipy.stats.
    .add_column(
        name="number_of_pokemon",
        type="poisson",
        params={"mean": 5}
    )
    .add_column(
        name="years_since_last_pokemon_sighting",
        type="gaussian",
        params={"mean": 5, "stddev": 5},
        convert_to="int"
    )
    # Constraints are currently supported for numerical samplers.
    .add_constraint(
        target_column="number_of_pokemon",
        type="scalar_inequality",
        params={"operator": ">=", "rhs": 1}
    )
    .add_constraint(
        target_column="years_since_last_pokemon_sighting",
        type="scalar_inequality",
        params={"operator": ">", "rhs": 0}
    )
)

In [None]:
# Pull it all together in a prompt that combines the elements we've created.
# Note how we access attributes of person objects the same as structured outputs.
aidd.add_column(
        name="adventure_story",
        model_alias="judge",
        prompt="""\
Create an engaging short story about a Pokemon adventure with the following elements:

**Characters and Setting:**
- Protagonist: {{ protagonist_first_name }} {{ protagonist_last_name }}
- Supporting character: French woman named {{ french_woman.first_name }}
- Antagonist: {{ random_bad_person.first_name }} {{ random_bad_person.last_name }}
- Context: {{ years_since_last_pokemon_sighting }} years since the last Pokemon sighting in the region
- Number of Pokemon to include: {{ number_of_pokemon }}
- Featured Pokemon: {{ pokemon_name }} (Type: {{ pokemon_type }}, HP: {{ hit_points }}, Attack: {{ attack_points }})

**Story Framework:**
- Theme: {{ story_theme }}
- Thematic elements to incorporate: {{ theme_details }}
- Begin with {{ protagonist_first_name }} encountering or searching for {{ pokemon_name }}
- Feature a conflict involving {{ random_bad_person.first_name }}
- Show how {{ french_woman.first_name }}'s expertise helps the protagonist
- End with a resolution that reflects the main theme

Write in a vivid, concise style that balances action, dialogue, and description while capturing the wonder of the Pokemon world.
"""
)


## 👀 Preview your AIDD workflow

In [None]:
preview = aidd.preview()

In [None]:
preview.display_sample_record()

## 🆙 Scale up with a batch workflow

In [None]:
workflow_run = aidd.create(
    num_records=100,
    workflow_run_name="pokemon_story_generator",
    wait_for_completion=True
)

In [None]:
workflow_run.dataset.df