<a target="_parent" href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/data-designer-101/1-the-basics.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 🎨 Data Designer 101: The Basics

In this notebook, we will demonstrate the basics of `DataDesigner` by generating a simple product review dataset.

<br>

### 💾 Install `gretel-client` and its dependencies

In [1]:
%%capture
%pip install git+https://github.com/gretelai/gretel-python-client

## ⚙️ Initialize Data Designer with a Model Suite

- `DataDesigner` uses "Model Suites" to group LLMs based on the permissiveness of their licenses.

- Example Model Suites include "apache-2.0" (fully permissive) and "llama-3.x" (llama community license agreement).

In [None]:
from gretel_client.navigator_client import Gretel

# We import AIDD column and parameter types using this shorthand for convenience.
import gretel_client.data_designer.params as P
import gretel_client.data_designer.columns as C

# The Gretel object is the SDK's main entry point for interacting with Gretel's API.
gretel = Gretel(api_key="prompt")

In [2]:
# Initialize a new Data Designer instance using the `data_designer` factory.
aidd = gretel.data_designer.new(model_suite="apache-2.0")

## 🎲 Getting started with sampler columns


- Sampler columns offer non-LLM based generation of synthetic data. 

- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.


<br>

Let's start designing our product review dataset by adding product category and subcategory columns.

In [None]:
aidd.add_column(
    C.SamplerColumn(
        name="product_category", 
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["Electronics", "Clothing", "Home & Kitchen", "Books", "Home Office"], 
        )
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="product_subcategory",
        type=P.SamplerType.SUBCATEGORY,
        params=P.SubcategorySamplerParams(
            category="product_category",  
            values={
                "Electronics": ["Smartphones", "Laptops", "Headphones", "Cameras", "Accessories"],
                "Clothing": ["Men's Clothing", "Women's Clothing", "Winter Coats", "Activewear", "Accessories"],
                "Home & Kitchen": ["Appliances", "Cookware", "Furniture", "Decor", "Organization"],
                "Books": ["Fiction", "Non-Fiction", "Self-Help", "Textbooks", "Classics"],
                "Home Office": ["Desks", "Chairs", "Storage", "Office Supplies", "Lighting"]
            }
        )
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="target_age_range",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["18-25", "25-35", "35-50", "50-65", "65+"]
        )
    )
)

# Optionally validate that the columns are configured correctly.
aidd.validate()


Next, let's add samplers to generate data related to the customer and their review.

In [None]:
# This column will sample synthetic person data based on statistics from the US Census.
aidd.add_column(
    C.SamplerColumn(
        name="customer",
        type=P.SamplerType.PERSON,
        params=P.PersonSamplerParams(age_range=[18, 70])
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="number_of_stars",
        type=P.SamplerType.UNIFORM,
        params=P.UniformSamplerParams(low=1, high=5),
        convert_to="int"
    )
)

aidd.add_column(
    C.SamplerColumn(
        name="review_style",
        type=P.SamplerType.CATEGORY,
        params=P.CategorySamplerParams(
            values=["rambling", "brief", "detailed", "structured with bullet points"],
            weights=[1, 2, 2, 1]
        )
    )
)

aidd.validate()

## 🦜 LLM-generated columns

- The real power of `DataDesigner` comes from leveraging LLMs to generate text, code, and structured data.

- For our product review dataset, we will use LLM-generated text columns to generate product names and customer reviews.

- When prompting the LLM, we can use Jinja templating to reference other columns in the dataset.

- As we see below, nested json columns can be accessed using dot notation.

In [None]:
aidd.add_column(
    C.LLMTextColumn(
        name="product_name",
        prompt=(
            "Come up with a creative product name for a product in the '{{ product_category }}' category, focusing "
            "on products related to '{{ product_subcategory }}'. The target age range of the ideal customer is "
            "{{ target_age_range }} years old. Respond with only the product name, no other text."
        ),
        # This is optional, but it can be useful for controlling the behavior of the LLM. Do not include instructions
        # related to output formatting in the system prompt, as AIDD handles this based on the column type.
        system_prompt=(
            "You are a helpful assistant that generates product names. You respond with only the product name, "
            "no other text. You do NOT add quotes around the product name. "
        )
    )
)

aidd.add_column(
    C.LLMTextColumn(
        name="customer_review",
        prompt=(
            "You are a customer named {{ customer.first_name }} from {{ customer.city }}, {{ customer.state }}. "
            "You are {{ customer.age }} years old and recently purchased a product called {{ product_name }}. "
            "Write a review of this product, which you gave a rating of {{ number_of_stars }} stars. "
            "The style of the review should be '{{ review_style }}'. "
        ),
    )
)

aidd.validate()

## 👀 Preview the dataset

- Iteration is key to generating high-quality synthetic data.

- Use the `preview` method to generate 10 records for inspection.

In [None]:
preview = aidd.preview()

In [None]:
# The preview dataset is available as a pandas DataFrame.
preview.dataset.df.head()

In [None]:
# Run this cell multiple times to cycle through the 10 preview records.
preview.display_sample_record()

## 🧐 Adding an Evaluation Report        

- `DataDesigner` offers an evaluation report for a quick look at the quality of the generated data.

- To add a report, which will be generated at the end of a batch workflow simply run the `with_evaluation_report` method.

In [None]:
aidd.with_evaluation_report()

## 🆙 Scale up!

- Once you are happy with the preview, scale up to a larger dataset by submitting a batch workflow.

- You can view the evaluation report by following the workflow link in the output of `create` below.

- Click the link to follow along with the generation process.

In [None]:
workflow_run = aidd.create(num_records=100, name="aidd-101-notebook-1-product-reviews")

## ⏭️ Next Steps

Now that you've seen the basics of `DataDesigner`, check out the following notebooks to learn more about:


#### Advanced generation techniques

- [Structured outputs and jinja expressions](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/data-designer-101/2-structured-outputs-and-jinja-expressions.ipynb)

- [Seeding synthetic data generation with an external dataset](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/data-designer-101/3-seeding-with-a-dataset.ipynb)

- [Using Custom Model Configurations](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/data-designer-101/4-custom-model-configs.ipynb)


#### Real-world Use Cases

- [Text-to-Python](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/text-to-code/text-to-python.ipynb)

- [Text-to-SQL](http://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/demo/navigator/text-to-code/text-to-sql.ipynb) 

- [RAG Evaluation](https://github.com/gretelai/gretel-blueprints/blob/main/docs/notebooks/data-designer/rag-examples/generate-rag-evaluation-dataset.ipynb)